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Preface 


The Gene Ontology (GO) is the leading project to organize biological knowledge on genes 
and their products in a formal and consistent way across genomic resources. This has had a 
profound impact at several levels. First, such standardization has made possible the integra- 
tion of multiple resources and sources of knowledge, thereby increasing their discoverabil- 
ity and simplifying their usage. Second, it has greatly facilitated—some might say exceedingly 
so—data mining, aggregate analyses, and other forms of automated knowledge extraction. 
Third, it has led to an increase in the overall quality of the resources by enforcing minimum 
requirements across all of them. 

Even considering these advantages, the rapid adoption of the GO in the community has 
been remarkable. In the 15 years since the publication of its introductory article [1], over 
100,000 scientific articles containing the keyword “Gene Ontology” have been published 
and the rate is still increasing (Google Scholar). 

However, despite this popularity and widespread use, many aspects of the Gene 
Ontology remain poorly understood [2], at times even by experts [3]. For instance, unbe- 
knownst to most users, routine procedures such as GO term enrichment analyses remain 
subject to biases and simplifying assumptions that can lead to spurious conclusions [4]. 

The objective of this book is to provide a practical, self-contained overview of the GO 
for biologists and bioinformaticians. After reading this book, we would like the reader to be 
equipped with the essential knowledge to use the GO and correctly interpret results derived 
from it. In particular, the book will cover the state of the art of how GO annotations are 
made, how they are evaluated, and what sort of analyses can and cannot be done with the 
GO. In the spirit of the Methods in Molecular Biology book series in which it appears, there 
is an emphasis on providing practical guidance and troubleshooting advice. 

The book is intended for a wide scientific audience and makes few assumptions about 
prior knowledge. While the primary target is the nonexpert, we also hope that seasoned GO 
users and contributors will find it informative and useful. Indeed, we are the first to admit 
that working with the GO occasionally brings to mind the aphorism “the more we know, 
the less we understand.” 

The book is structured in six main parts. Part I introduces the reader to the fundamen- 
tal concepts underlying the Gene Ontology project, with primers on ontologies in general 
(Chapter 1), on gene function (Chapter 2), and on the Gene Ontology itself (Chapter 3). 

To become proficient GO users, we need to know where the GO data comes from. Part 
II reviews how the GO annotations are made, be it via manual curation of the primary lit- 
erature (Chapter 4), via computational methods of function inference (Chapter 5), via lit- 
erature text mining (Chapter 6), or via crowdsourcing and other contributions from the 
community (Chapter 7). 

But can we trust these annotations? In Part III, we consider the problem of evaluating 
GO annotations. We first provide an overview of the different approaches, the challenges 
associated with them, but also some successful initiatives (Chapter 8). We then focus on the 
more specific problem of evaluating enzyme function predictions (Chapter 9). Last, we 
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reflect on the achievements of the Critical Assessment of protein Function Annotation 
(CAFA) community experiment (Chapter 10). 

Having made and validated GO annotations, we proceed in Part IV to use the GO 
resource. We consider the various ways of retrieving GO data (Chapter 11), how to quantify 
the functional similarity of GO terms and genes (Chapter 12), or perform GO enrichment 
analyses (Chapter 13)—all the while avoiding common biases and pitfalls (Chapter 14). 
The part ends with a chapter on visualizing GO data (Chapter 15) as well as a tutorial on 
GO analyses in the programming language Python (Chapter 16). 

Part V covers two advanced topics: annotation extensions, which make it possible to 
express relationships involving multiple terms (Chapter 17), and the evidence code ontol- 
ogy, which provides a more precise and expressive specification of supporting evidence than 
the traditional GO annotation evidence codes (Chapter 18). 

Part VI goes beyond the GO, by considering complementary sources of functional infor- 
mation such as KEGG and Enzyme Commission numbers (Chapter 19), and by considering 
the potential of integrating GO with controlled clinical nomenclatures (Chapter 20). 

The final part concludes the book with a perspective by Suzi Lewis on the past, present, 
and future of the GO (Chapter 21). 


London, UK Christophe Dessimoz 
Zurich, Switzerland Nives Skunca 
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Fundamentals 


Chapter 1 


Primer on Ontologies 


Janna Hastings 


Abstract 


As molecular biology has increasingly become a data-intensive discipline, ontologies have emerged as an 
essential computational tool to assist in the organisation, description and analysis of data. Ontologies 
describe and classify the entities of interest in a scientific domain in a computationally accessible fashion 
such that algorithms and tools can be developed around them. The technology that underlies ontologies 
has its roots in logic-based artificial intelligence, allowing for sophisticated automated inference and error 
detection. This chapter presents a general introduction to modern computational ontologies as they are 
used in biology. 


Key words Ontology, Knowledge representation, Bioinformatics, Artificial intelligence 


1 Introduction 


Examining aspects of the world to determine the nature of the 
entities that exist and their causal networks is at the heart of many 
scientific endeavours, including the modern biological sciences. 
Advances in technology have made it possible to perform large- 
scale high-throughput experiments, yielding results for thousands 
of genes or gene products in single experiments. The data from 
these experiments are growing in public repositories [1], and in 
many cases the bottleneck has moved from the generation of these 
data to the analysis thereof [2 ]. In addition to the sheer volume of 
data, as the focus has moved to the investigation of systems as a 
whole and their perturbations [3], it has become increasingly nec- 
essary to integrate data from a variety of disparate technologies, 
experiments, labs and even across disciplines. Natural language 
data description is not sufficient to ensure smooth data integration, 
as natural language allows for multiple words to mean the same 
thing, and single words to mean multiple things. There are many 
cases where the meaning of a natural language description is not 
fully unambiguous. Ontologies have emerged as a key technology 
going beyond natural language in addressing these challenges. 


Christophe Dessimoz and Nives Skunca (eds.), The Gene Ontology Handbook, Methods in Molecular Biology, vol. 1446, 
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The most successful biological ontology (bio-ontology) is the 
Gene Ontology (GO) [4], which is the subject of this volume. 

Ontologies are computational structures that describe the 
entities and relationships of a domain of interest in a structured 
computable format, which allows for their use in multiple applica- 
tions [5, 6]. At the heart of any ontology is a set of entities, also 
called classes, which are arranged into a hierarchy from the general 
to the specific. Additional information may be captured such as 
domain-relevant relationships between entities or even complex 
logical axioms. These entities that are contained in ontologies are 
then available for use as hubs around which data can be organised, 
indexed, aggregated and interpreted, across multiple different ser- 
vices, databases and applications [7 ]. 


2 Elements of Ontologies 


21 Classes 


22 Metadata 


Ontologies consist of several distinct elements, including classes, 
metadata, relationships, formats and axioms. 


The class is the basic unit within an ontology, representing a type 
of thing in a domain of interest, for example carboxylic acid, heart, 
melanoma and apoptosis. Typically, classes are associated with a 
unique identifier within the ontology’s namespace, for example 
(respectively) CHEBI:33575, FMA:7088, DOID:1909 and 
GO:0006915. Such identifiers are semantics free (they do not con- 
tain a reference to the class name or definition) in order to pro- 
mote stability even as scientific knowledge and the accompanying 
ontology representation evolve. Ontology providers commit to 
maintaining identifiers for the long term, so that if they are used in 
annotations or other application contexts the user can rely on their 
resolution. In some cases as the ontology evolves, multiple entries 
may become merged into one, but in these cases alternate identi- 
fiers are still maintained as secondary identifiers. When a class is 
deemed to no longer be needed within the ontology it may be 
marked as obsolete, which then indicates that the ID should not be 
used in further annotations, although it is preserved for historical 
reasons. Obsolete classes may contain metadata pointing to one or 
more alternative classes that should be used instead. 


Classes are usually associated with annotated textual information— 
metadata. The metadata associated with classes may include any 
associated secondary (alternate) identifiers and flags to indicate 
whether the class has been marked as obsolete. It may also include 
one or more synonyms; for example the synonyms of apoptotic 
process (a class in the GO) include cell suicide, programmed cell 
death and apoptosis. It further may include cross references to that 
class in alternative databases and web resources. For example, many 
Chemical Entities of Biological Interest (ChEBI) [8] entries 
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contain cross references to the KEGG resource [9], which repre- 
sents those chemicals in the context of the biological pathways they 
participate in. Textual comments and examples of intended usage 
may be annotated. It is very important that each class include a 
clear definition, which provides enough information to pinpoint 
the meaning of the class and suggest its appropriate use—suffi- 
ciently distinguishing different classes in an ontology so that a user 
can determine which is the best to use for annotation. The defini- 
tion of apoptosis offered by the Gene Ontology is as follows: 


A programmed cell death process which begins when a cell receives an 
internal (e.g. DNA damage) or external signal (e.g. an extracellular 
death ligand), and proceeds through a series of biochemical events 
(signaling pathway phase) which trigger an execution phase. The exe- 
cution phase is the last step of an apoptotic process, and is typically 
characterized by rounding-up of the cell, retraction of pseudopodes, 
reduction of cellular volume (pyknosis), chromatin condensation, 
nuclear fragmentation (karyorrhexis), plasma membrane blebbing and 
fragmentation of the cell into apoptotic bodies. When the execution 
phase is completed, the cell has died. 


Classes are arranged in a hierarchy from the general (high in the 
hierarchy) to the specific (low in the hierarchy). For example, in 
ChEBI carboxylic acid is classified as a carbon oxoacid, which in turn 
is classified as an oxoacid, which in turn is classified as a hydroxide, 
and so on up to the root chemical entity, which is the most general 
term in the structure-based classification branch of the ontology. 

Despite the hierarchical organisation, most ontologies are not 
simple trees. Rather, they are structured as directed acyclic graphs. 
This is because it is possible for classes to have multiple parents in 
the classification hierarchy, and furthermore ontologies include 
additional types of relationships between entities other than hierar- 
chical classification (which itself is represented by is_a relations). 
All relations are directed and care must be taken by the ontology 
editors to ensure that the overall structure of the ontology does 
not contain cycles, as illustrated in Fig. 1. 


B C 


Fig. 1 (a) A simple hierarchical tree, (b) a directed, acyclic graph, (c) a graph that contains a cycle, indicated in red 
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Table 1 


A selection of relationship types commonly used in bio-ontologies 


Relationship 
type Informal meaning Examples 
part_of The standard relation of parthood. A brain is part_of a body. 


derives_from 


has_participant 


has_function 


Derivation holds between distinct entities when one 
succeeds the other across a temporal divide in such a 
way that a biologically significant portion of the 
matter of the earlier entity is inherited by the latter. 


A relation that links processes to the entities that 
participate in them. 


A relation that links material entities to their functions, 
e.g. the biological functions of macromolecules. 


A zygote derives_from a 
sperm and an ovum. 


An apoptotic process 
has_participant a cell. 


An enzyme has_function 
to catalyse a specific 


reaction type. 


24 Formats 


A common relationship type used in multiple ontologies is 
part_of or has_part, representing composition or constitution. For 
example, in the Foundational Model of Anatomy (FMA) [10], heart 
has_part aortic valve. The Relationship Ontology (RO) defines sev- 
eral relationship types that are commonly used across multiple bio- 
ontologies [11], a selection of which is shown in Table 1. 

In addition, specific ontologies may also include additional 
relationships that are particular to their domain. For example, GO 
includes biological process-specific relations such as regulates, 
while ChEBI includes chemistry-specific relationships such as is_ 
tautomer_of and is_enantiomer_of. 

The specification for a relationship type in an ontology includes 
a unique identifier, name and classification hierarchy, as for classes, 
as well as a specification whether the relationship is reflexive (i.e. A 
rel B if and only if B rel A) and/or transitive (if A rel B and B rel 
C then A rel C), and the name of the inverse relationship type if it 
exists. The same metadata as is associated with the classes in the 
ontology may also be associated with relationship types: alternative 
identifiers, synonyms, a definition and comments, and a flag to 
indicate if the relationship is obsolete. 


Typically, ontologies are stored in files conforming to a specific file 
format, although there are exceptions that are stored in custom- 
built infrastructures. Ontologies may be represented in different 
underlying ontology languages, and historically there has been an 
evolution of the capability of ontology languages towards greater 
logical expressivity and complexity, which is mirrored by the 
advances in computational capacity (hardware) and tools. Biological 
ontologies such as the GO have historically been represented in the 
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human-readable Open Biomedical Ontologies (OBO) language,’ 
which was designed specifically for the structure and metadata con- 
tent associated with bio-ontologies, but in recent years there has 
been a move towards the Semantic Web standard Web Ontology 
Language (OWL)? largely due to the latter's adoption within a 
wider community and expansive tool support. Within OWL, spe- 
cific standardised annotations are used to encode the metadata 
content of bio-ontologies as OWL annotations. However, the dis- 
tinction has become cosmetic to some extent, as tools have been 
created which are able to interconvert between these languages 
[12], provided that certain constraints are adhered to. 


Within logic-based languages such as OWL, statements in ontolo- 
gies have a definite logical meaning within a set-based logical the- 
ory. Classes have instances as members, and logical axioms define 
constraints on class definitions that apply to all class members. For 
example, the statement carboxylic acid is a carbon oxoacid has the 
logical meaning that all instances of carboxylic acid are also 
instances of carbon oxoacid: 


V x: CarboxylicAcid (x) — CarbonOxoacid (x) 


The logical languages underlying ontology technology are collec- 
tively called Description Logics [13]—in the plural because there 
are different variants with different levels of complexity. Some of 
the different ingredients of logical axioms that are available in the 
OWL language—quantification, cardinality, logical connectives 
and negation, disjointness and class equivalence—are explained in 
Table 2. 

Like the carboxylic acid example above, each of these axiom 
types can be expressed as a logical statement. With these axioms, 
logic-based ontology reasoners are able to check for errors in an 
ontology. For example, if a class relation is quantified with ‘only’ 
such as the hydrocarbon example given in the table, which in logi- 
cal language means 


V x V y : Hydrocarbon(x) ^ hasPart (x,y) €» Hydrogen( y) v Carbon (y) 


and then if a subclass of hydrocarbon in the ontology has a has 
part relation with a target other than a hydrogen or a carbon 
(e.g. an oxygen): 


Hydrocarbon (a) ^ hasPart (a, b) ^ Oxygen(b) 


that class will be detected as inconsistent and flagged as such by the 
reasoner. 


: http://www.cs.man.ac.uk/~horrocks/obo/ 
~ http://www.w3.org/TR/owl2-overview/ 
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Table 2 


Logical constructs available in the OWL language 


Language component 


Informal meaning Examples 


Quantification: 
universal (only) or 
existential (some) 


Cardinality: exact, 
minimum or 
maximum 


Logical connectives: 
intersection (and) or 
union (or) 


Negation (not) 


Disjointness of classes 


Equivalence of classes 


When specifying relationships between molecule has_part some atom 


classes, it is necessary to specify a constraint hydrocarbon has_part only 
on how the relationship should be (hydrogen or carbon) 
interpreted: universal quantification means 

that for all relationships of that type the 

target has to belong to the specified class, 

while existential quantification means that 

at least one member of the target class must 

participate in a relationship of that type 


It is possible to specify the number of human has. part exactly 2 leg 


relationships with a given type and target 
that a class must participate in, or a 
minimum or maximum number thereof. 


It is possible to build complex expressions by vitamin B equivalentTo 


joining together parts using the standard (thiamin or riboflavin or 

logical connectives and, or. niacin or pantothenic acid 
or pyridoxine or folic acid 
or vitamin B12) 


In addition to building complex expressions — tailless equivalentTo 


using the logical connectives, it is possible not (has part some tatl) 
to compose negations. 


It is possible to specify that classes should organic disjointFrom 
not share any members. inorganic 

It is possible to specify that two classes—or melanoma equivalentTo (skin 
class expressions—are logically equivalent, cancer and develops_from 


and that they must by definition thus share some melanocyte) 
all their members. 


3 Tools 


The end result—an ontology which combines terminological 
knowledge with complex domain knowledge captured in logical 
form—is thus amenable to various sophisticated tools which are 
able to use the captured knowledge to check for errors, derive 
inferences and support analyses. 


Developing a complex computational knowledge base such as a 
bio-ontology (for example, the Gene Ontology includes 43,980 
classes) requires tool support at multiple levels to assist the human 
knowledge engineers (curators) with their monumental task. For 
editing ontologies, a commonly used freely available platform is 
Protégé [14]. Protégé allows the editing of all aspects of an 
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ontology including classes and relationships, logical axioms (in the 
OWL language) and metadata. Protégé furthermore includes built- 
in support for the execution of automated reasoners to check for 
logical errors and for ontology visualisation using various different 
algorithms. Examples of reasoners that can be used within Protégé 
are HermiT [15] and Fact++ [16]. For the rapid editing and con- 
struction of ontologies, various utilities are available, such as the 
creation of a large number of classes in a single ‘wizard’ step. The 
software is open source and has a pluggable architecture, which 
allows for custom modular extensions. Protégé is able to open both 
OBO and OWL files, but it is designed primarily for the OWL lan- 
guage. An alternative editor specific to the OBO language is OBO- 
Edit [17]. Relative to Protégé, OBO-Edit offers more sophisticated 
metadata searching and a more intuitive user interface. 

To browse, search and navigate within a wide variety of bio- 
ontologies without installing any software or downloading any 
files, the BioPortal web platform provides an indispensable resource 
[18] that is especially important when using terminology from 
multiple ontologies. Additional browsing interfaces for multiple 
ontologies include the OLS [19] and OntoBee [20]. Most ontolo- 
gies are also supported by one or more browsing interfaces specific 
to that single ontology, and for the Gene Ontology the most com- 
monly used interfaces are AmiGO [21 ] and QuickGO [22]. 

Large-scale ontologies such as the GO and ChEBI are often 
additionally supported by custom-built software tailored to their 
specific use case, for example embedding the capability to create 
species-specific ‘slims’ (subsets of terms of the greatest interest 
within the ontology for a specific scenario) for the GO, or chemin- 
formatics support for ChEBI. As ontologies are shared across com- 
munities of users, an important part of the tool support profile is 
tools for the community to provide feedback and to submit addi- 
tional entries to the ontology. 


The purposes that are supported by modern bio-ontologies are 
diverse. The most straightforward application of ontologies is to 
support the structured annotation of data in a database. Here, 
ontologies are used to provide unique, stable identifiers—associ- 
ated to a controlled vocabulary—around which experimental data 
or manually captured reference information can be gathered [23]. 
An ontology annotation links a database entry or experimental 
result to an ontology class identifier, which, being independent of 
the single database or resource being annotated, is able to be 
shared across multiple contexts. Without such shared identifiers for 
biological entities, discrepant ways of referring to entities tend to 
accumulate—different key words, or synonyms, or variants of 
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identifying labels—which significantly hinders reuse and integra- 
tion of the relevant data in different contexts. 

Secondly, ontologies can serve as a rich source of vocabulary 
for a domain of interest, providing a dictionary of names, syn- 
onyms and interrelationships, thereby facilitating text mining (the 
automated discovery of knowledge from text) [24], intelligent 
searching (such as automatic query expansion and synonym search- 
ing, an example is described in [25]) and unambiguous identifica- 
tion. When used in multiple independent contexts, such a common 
vocabulary can become additionally powerful. For example, unit- 
ing the representation of biological entities across different model 
organisms allows common annotations to be aggregated across 
species [26], which facilitates the translation of results from one 
organism into another in a fashion essential for the modern accu- 
mulation of knowledge in molecular biology. The use of a shared 
ontology also allows the comparison and translation entities from 
one discipline to another such as between biology and chemistry 
[27], enabling interdisciplinary tools that would be impossible 
computationally without a unified reference vocabulary. 

While the above applications would be possible even if ontol- 
ogies consisted only of controlled vocabularies (standardised sets 
of vocabulary terms), the real power of ontologies comes with 
their hierarchical organisation and use of formal inter-entity rela- 
tionships. Through the hierarchy of the ontology, it is possible to 
annotate data to the most specific applicable term but then to 
examine large-scale data in aggregate for patterns at the higher 
level categories. By centralising the hierarchical organisation in 
an application-independent ontology, different sources of data 
can be aggregated to converge as evidence for the same class-level 
inferences, and complex statistical tools can be built around 
knowledge bases of ontologies combined with their annotations, 
which check for over-representation or under-representation of 
given classes in the context of a given dataset relative to the back- 
ground of everything that is known [28] (for more information 
see Chap. 13 [29]). The knowledge-based relationships captured 
in the ontology can be used to assign quantitative measures of 
similarity between entities that would otherwise lack a quantifi- 
able comparative metric [30] (for more information see Chap. 12 
[31]). And the relationships between entities can be used to 
power sophisticated knowledge-based reasoning, such as the 
inference of which organs, tissues and cells belong to in anatomi- 
cal contexts [32]. 

With all these applications in mind, it is no wonder that the 
number and scope of bio-ontologies have been proliferating over 
the last decades. The OBO Foundry is a community organisation 
that offers a web portal in which participating ontologies are listed 
[33]. The web portal currently lists 137 ontologies, excluding 


5 Limitations 


Acknowledgements 


Primer on Ontologies 11 


obsolete records. Each of these ontologies has biological relevance 
and has agreed to abide by several community principles, including 
providing the ontology under an open license. Examples of these 
ontologies include ChEBI, the FMA, the Disease Ontology [34] 
and of course the Gene Ontology which is the topic of this book. 
In the context of the OBO Foundry, different ontologies are now 
becoming interrelated through inter-ontology relationships [35 ], 
and where there are overlaps in content they are being resolved 
through community workshops. 


Ontologies are a powerful technology for encoding domain knowl- 
edge in computable form in order to drive a multitude of different 
applications. However, they are not one-stop solutions for all 
knowledge representation requirements. There are certain limita- 
tions to the type of knowledge they can encode and the ways that 
applications can make use of that encoded knowledge. 

Firstly, it is important to bear in mind that ontologies are based 
on logic. They are good at representing statements that are either 
true or false (categorical), but they cannot elegantly represent 
knowledge that is vague, statistical or conditional [36]. Classes 
that derive their meaning from comparison to a dynamic or condi- 
tional group (e.g. the shortest person in the room, which may vary 
widely) are also not possible to represent well within ontologies. It 
can be difficult to adequately capture knowledge about change 
over time at the class level, i.e. classes in which the members par- 
ticipate in relationships at one time and not at another, as including 
a temporal index for each relation would require ternary relations 
which neither the OBO nor the OWL language support. 

Furthermore, although the underlying technology for repre- 
sentation and automated reasoning has advanced a lot in recent 
years, there are still pragmatic limits to ensure the scalability of the 
reasoning tools. For this reason, higher order logical statements, 
non-binary relationships and other complex logical constructs can- 
not yet be represented and reasoned with in most of the modern 
ontology languages. 


The author was supported by the European Molecular Biology 
Laboratory (EMBL). Open Access charges were funded by the 
University College London Library the Swiss Institute of 
Bioinformatics, the Agassiz Foundation, and the Foundation for 
the University of Lausanne. 
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Chapter 2 


The Gene Ontology and the Meaning of Biological Function 


Paul D. Thomas 


Abstract 


The Gene Ontology (GO) provides a framework and set of concepts for describing the functions of gene 
products from all organisms. It is specifically designed for supporting the computational representation of 
biological systems. A GO annotation is an association between a specific gene product and a GO concept, 
together making a statement pertinent to the function of that gene. However, the meaning of the term 
“function” is not as straightforward as it might seem, and has been discussed at length in both philosophi- 
cal and biological circles. Here, I first review these discussions. I then present an explicit formulation of the 
biological model that underlies the GO and annotations, and discuss how this model relates to the broader 
debates on the meaning of biological function. 


Key words Genome, Function, Ontology, Selected effects, Causal role 


1 What Is Biological Function? 


The notion of function in biology has received a great deal of 
attention in the philosophical literature. At the broadest level, 
there are two schools of thought on how functions should be 
defined, now most commonly referred to as “causal role function” 
and “selected effect function.” Causal role function was first pro- 
posed by Cummins [1], and it focuses on describing function in 
terms of how a part contributes to some overall capacity of the 
system that contains the part. In this formulation, the function of 
an entity is relative to some system to which it contributes. For 
example, the statement “the function of the heart is to pump 
blood” has meaning only in the context of the larger circulatory 
system’s capacity to deliver nutrients and remove waste products 
from bodily tissues. However, one of the main objections to the 
causal role definition of function is that there is no systematic way 
to identify what the larger system (and the relevant capacity of that 
system) should be. Selected effect function, on the other hand, 
derives from the “etiological” definition of function first proposed 
by Wright [2]. In this formulation, a function of an entity is the 
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ultimate answer to the question of why the entity exists at all. In 
biology, as explained by Millikan [3] and Neander [4], this is tan- 
tamount to asking the following: For which of its effects was it 
selected during evolution? One obvious advantage of the selected 
effect definition is that it explicitly incorporates evolutionary con- 
siderations, and demands that a function ultimately derive from its 
history of natural selection. On the more practical side, it has the 
further advantage of putting constraints on which effects, out of 
the myriad causal effects that a particular entity might have, could 
be considered as functions. Following the example above, an effect 
of the heart (beating) is to produce a sound, but it would not be 
correct to say that the function of the heart is to produce a sound. 
The selected effects definition of function would distinguish a 
proper function (e.g., pumping blood) from an “accidental” effect 
(e.g., producing a sound) on the basis that natural selection more 
likely operated on the heart’s effect of pumping blood. In the 
causal role definition, on the other hand, there is always the poten- 
tial for arbitrariness and idiosyncrasy in defining a containing sys- 
tem and capacities; thus there is no general rule for distinguishing 
functional from accidental effects. 

Nevertheless, causal role function has been stalwartly defended 
by biologists in the subdiscipline of functional anatomy [5], which 
emphasizes how anatomical parts function as parts of larger sys- 
tems. They claim that the selected trait can be difficult to infer, and 
lack of a hypothesis for such a trait should not stand in the way of 
an analysis of the mechanism of how an anatomical feature oper- 
ates. For example, one could analyze a jaw in terms of its capacity 
for generating a crushing force irrespective of whether it was 
selected for crushing seeds or defending against a predator. Indeed, 
the search for mechanisms of operation, or more generally just 
“mechanism,” has more recently been offered as an alternative par- 
adigm for molecular and neurobiology in particular [ 6 ]. Mechanism, 
like causal role, focuses on how parts contribute to a system. But it 
takes a step further in defining core concepts, and how these relate 
to function. The core concepts are entities and activities: physical 
entities (such as proteins) perform activities, or actions that can 
have causal effects on other activities. In this view, a function is 
simply an activity that is carried out as part of a larger mechanism. 
For example, the function of the ribosome (an entity) is translation 
(an activity), and translation plays a role in a larger mechanism of 
gene expression. The subtle difference from earlier formulations of 
function is an emphasis on the activity having the role of a function, 
rather than the entity itself having a function. Also like causal role, 
no a priori constraints are put on mechanism: “a function is ... a 
component in some mechanism, that is ... in a context that is taken 
to be important, vital, or otherwise significant.” Clearly mechanism 
is susceptible to the same criticism as causal role function, regard- 
ing arbitrariness in the choice of system. 
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The core differences between selected effect function and causal 
role function derive largely from differences in what question they 
are trying to answer. For selected effect function, the question is 
about origins: Why is the entity there (i.e., what explains its selective 
advantage)? [2]. For causal role function, the question is about 
operation: How does the entity contribute to the biological capaci- 
ties of the organism that has the entity (and only secondarily, how do 
those capacities relate to natural selection)? [1]. And there is little 
doubt that in most biological research endeavors today, the concern 
is in elucidating the mechanisms by which biological systems oper- 
ate, rather than in explaining why the parts are there to begin with. 

The notion of function, particularly in connection with molec- 
ular biology, has been discussed at length not only by philosophers, 
but also by molecular biologists themselves. As a representative 
sample, I will consider two publications written with very different 
aims in mind: a textbook chapter by Alberts entitled “Protein 
Function” [7] and a philosophical treatise by Monod, Chance and 
Necessity [8]. Alberts’ treatment of “function” covers two distinct 
but related senses of the word. The first is how an individual pro- 
tein works at the mechanistic level (its manner of functioning): 
“how proteins bind to other selected molecules and how their 
activity depends on such binding.” The second is to describe how 
a protein acts as a component in a larger system, by analogy to 
mechanical parts in human-designed systems (its functional role in 
the context of the operation of the cell): “proteins ... act as cata- 
lysts, signal receptors, switches, motors, or tiny pumps.” Specific 
molecular binding can be considered the general mechanism by 
which a functional role can be carried out. These uses of “func- 
tion” appear, at least on the face of it, to be more in line with the 
causal role and mechanism views in the philosophical literature. 

Given its broader intended audience of scientists and laymen 
(and presumably philosophers), Chance and Necessity puts biologi- 
cal function in a much broader context. Monod coins the term 
“teleonomic function” to describe more precisely what he means 
by function. He carefully defines teleonomy as the characteristic of 
“objects endowed with a purpose or project, which at the same 
time they exhibit through their structure and carry out through 
their performances” [p. 9]. Teleonomy is also a property of human- 
designed “artifacts,” further emphasizing the view of function in 
terms of an apparent purpose in accomplishing a predetermined 
aim. But living systems owe their teleonomy to a distinct source. 
As he so eloquently (if also compactly) states, “invariance necessar- 
ily precedes teleonomy” [p. 23], which he goes on to explain fur- 
ther as “the Darwinian idea that the initial appearance, evolution 
and steady refinement of ever more intensely teleonomic structures 
are due to perturbations in a structure which already possesses the 
property of invariance.” Thus what appears to be a future-goal- 
oriented action by a living organism is, in fact, only a blind 
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repetition ofa genetic program that evolved in the past. Importantly, 
Monod notes the presence of teleonomy at all levels of a biological 
system, from proteins (which he calls “the essential molecular 
agents of teleonomic performance”) to “systems providing large 
scale coordination of the organism's performances ... [such as] the 
endocrine and nervous systems? [p. 62]. In this way, Monod's 
teleonomic function includes aspects of both Wright's selected 
effect function (the origin of apparently designed functions in prior 
natural selection) and Cummins's causal role function (the role of 
a part in a larger system). 

In summary, function as conceived by molecular biologists (in 
what could be called the *molecular biology paradigm") refers to 
specific, coordinated activities that have the appearance of having 
been designed for a purpose. That apparent purpose is their func- 
tion. The appearance of design derives from natural selection, so 
many biologists now favor the use of the term “biological pro- 
gram” to avoid connotations of intentional design. Following this 
convention, biological programs, when executed, perform a func- 
tion; that is, they result in a particular, previously selected outcome 
or causal effect. Biological programs are nested modularly inside 
other, larger biological programs, so a protein can be said to have 
functions at multiple levels. The lowest level biological program is 
expression of a single macromolecule, e.g., a protein: the gene is 
transcribed into RNA, which is translated into a protein, which 
adopts a particular structure that performs its function simply by 
following physical laws that determine how it will interact with 
specific (i.e., a small number) of other distinct types of other 
molecular entities. At higher levels, the functions of multiple pro- 
teins are executed in a coherent, controlled (“regulated”) manner 
to accomplish a larger function. Thus, simply identifying a coher- 
ent, regulated system of activities can be a fruitful, practical start 
for identifying selected effect functions. Causal role analyses can 
and do play such a role in functional anatomy and molecular biol- 
ogy. But of course they are only candidates for evolved biological 
functions until they have been related to past survival and repro- 
duction, the ultimate function of every biological program. 


2 Function in the Gene Ontology 


2.1 Gene Products, 
Not Genes, Have 
Functions 


I now turn to a description of how function is conceived of, and 
represented in practice, in the Gene Ontology. 


In order to understand how gene function is represented in the 
GO, some basic molecular biology knowledge is required. 


— A gene is a contiguous region of DNA that encodes instruc- 
tions for how the cell can make a large (“macro”) molecule 
(or potentially multiple different macromolecules). 


2.2 Assertions 
About Functions 

of Particular Genes 
Are Made by “GO 
Annotations” 


A3 The Model 
of Gene Function 
Underlying the GO 
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— A macromolecule is called a gene product (as it is produced 
deterministically according to the instructions from a gene), 
and can be of two types, a protein (the most common type) or 
a noncoding RNA. 


— A gene product can act as a molecular machine; that is, it can 
perform a chemical action that we call an activity. 


— Gene products from different genes can combine into a larger 
molecular machine, called a macromolecular complex. 


Each concept in the Gene Ontology relates to the activity of a 
gene product or complex, as these are the entities that carry out 
cellular processes. A gene encodes a gene product, so it can obvi- 
ously be considered the ultimate source of these activities and pro- 
cesses. But strictly speaking, a gene does not perform an activity 
itself. Thus, when the Gene Ontology refers to *gene function," it 
is actually shorthand for *gene product function." 


The Gene Ontology defines the *universe" of possible functions a 
gene might have, but it makes no claims about the function of any 
particular gene. Those claims are, instead, captured as *GO anno- 
tations." A GO annotation is a statement about the function of a 
particular gene. But our biological knowledge is extremely incom- 
plete. Accordingly, the GO annotation format is designed to cap- 
ture partial, incomplete statements about gene function. A GO 
annotation typically associates only a single GO concept with a 
single gene. Together, these statements comprise a “snapshot” of 
current biological knowledge. Different pieces of knowledge 
regarding gene function may be established to different degrees, 
which is why each GO annotation always refers to the evidence 
upon which it is based. 


The Gene Ontology (GO) considers three distinct aspects of 
how gene functions can be described: molecular function, cellu- 
lar component, and biological process (note that throughout 
this chapter, bold text will denote specific concepts, or classes, 
from the Gene Ontology). In order to understand what these 
aspects mean and how they relate to each other, it may be helpful 
to consider the biological model assumed in GO annotations. GO 
follows what could be called the *molecular biology paradigm," as 
described in the previous section. In this representation, a gene 
encodes a gene product, and that gene product carries out a 
molecular-level process or activity (molecular function) in a spe- 
cific location relative to the cell (cellular component), and this 
molecular process contributes to a larger biological objective (bio- 
logical process) comprised of multiple molecular-level processes. 
An example, elaborating on the example in the original GO paper 
[9], is shown in Fig. 1. 
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DNA-directed DNA replication 


complex(MCM2-7) complex(PRI1-2) complex(RFC2-5) 


complex(POL3,POL31,POL32) CDC9 


Fig. 1 DNA replication (in yeast) as modeled using the GO. Gene products/complexes (white) perform molecu- 
lar processes (molecular function, red) in specific locations (cellular component, yellow), as part of larger 
biological objectives (biological process, specifically DNA-directed DNA replication) 


2.4 Molecular 
Functions Define 
Molecular Processes 
(Activities) 


25 Cellular 
Components Define 
Places Where 
Molecular Processes 
Occur 


To reiterate, GO concepts were designed to apply specifically 
to the actions of gene products, i.e., macromolecular machines 
comprising proteins, RNAs, and stable complexes thereof. In the 
GO representation, a region of DNA (e.g., a regulatory region) is 
treated not as carrying out a molecular process, but rather as an 
object that gene products can act upon in order to perform their 
specific activities. 


In the GO, a molecular function is a process that can be carried 
out by the action of a single macromolecular machine, via direct 
physical interactions with other molecular entities. Function in this 
sense denotes an action, or activity, that a gene product performs. 
These actions are described from the two distinct but related per- 
spectives commonly employed by biologists: (1) biochemical activ- 
ity, and (2) role as a component in a larger system/process. 
Biochemical activities include binding and catalytic activities, and 
are only functions in the broad sense, i.e., how something func- 
tions, the molecular mechanism of operation. Component role 
descriptions, on the other hand, refer to roles in larger processes, 
and are sometimes described by analogy to a mechanical or electri- 
cal system. For example, biologists may refer to a protein that func- 
tions (acts) as a receptor. This is because the activity is interpreted 
as receiving a signal, and converting that signal into another physi- 
cochemical form. Unlike biochemical activities, these roles require 
some degree of interpretation that includes knowledge of the larger 
system context in which the gene product acts. 


A cellular component is a location, relative to cellular compart- 
ments and structures, occupied by a macromolecular machine when 
it carries out a molecular function. There are two ways in which 
biologists describe locations of gene products: (1) relative to cellu- 
lar structures (e.g., cytoplasmic side of plasma membrane) or 
compartments (e.g., mitochondrion), and (2) the stable 


2.6 Biological 
Processes Define 
Biological Programs 
Comprised 

of Regulated 
Molecular Processes 
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macromolecular complexes of which they are parts (e.g., the ribo- 
some). Unlike the other aspects of GO, cellular component con- 
cepts refer not to processes but rather a cellular anatomy. 
Nevertheless, they are designed to be applied to the actions of gene 
products and complexes: a GO annotation to a cellular compo- 
nent provides information about where a molecular process may 
occur during a larger process. 


In the GO, a biological process represents a specific objective that 
the organism is genetically “programmed” to achieve. Each bio- 
logical process is often described by its outcome or ending state, 
e.g., the biological process of cell division results in the creation 
of two daughter cells (a divided cell) from a single parent cell. A 
biological process is accomplished by a particular set of molecular 
processes carried out by specific gene products, often in a highly 
regulated manner and in a particular temporal sequence. 

An annotation ofa particular gene product to a GO biological 
process concept should therefore have a clear interpretation: the 
gene product carries out a molecular process that plays an integral 
role in that biological program. But a gene product can affect a 
biological objective even ifit does not act strictly within the process, 
and in these cases a GO annotation aims to specify that relationship 
insofar as it is known. First, a gene product can control when and 
where the program is executed; that is, it might regulate the pro- 
gram. In this case, the gene product acts outside of the program, 
and controls (directly or indirectly) the activity of one or more gene 
products that act within the program. Second, the gene product 
might act in another, separate biological program that is required for 
the given program to occur. For instance, animal embryogenesis 
requires translation, though translation would not generally be con- 
sidered to be part of the embryogenesis program. Thus, currently a 
given biological process annotation could have any of these three 
meanings (namely a gene activity could be part of, regulate, or be 
upstream of but still necessary for, a biological process). The GO 
Consortium is currently exploring ways to computationally repre- 
sent these different meanings so they can be distinguished. 

Biological process is the largest of the three ontology aspects in 
the GO, and also the most diverse. This reflects the multiplicity of 
levels of biological organization at which genetically encoded pro- 
grams can be identified. Biological process concepts span the entire 
range of how biologists characterize biological systems. They can be 
as simple as a generic enzymatic process, e.g., protein phosphory- 
lation, to molecular pathways such as glycolysis or the canonical 
Wnt signaling pathway, to complex programs like embryo devel- 
opment or learning, and even including reproduction, the ulti- 
mate function of every evolutionarily retained gene. 

Because of this diversity, in practice not all biological process 
classes actually represent coherent, regulated biological programs. 
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In particular, GO biological process also includes molecular-level 
processes that cannot always be distinguished from molecular func- 
tions. Taking the previous example, the process class protein phos- 
phorylation overlaps in meaning with the molecular activity class 
protein kinase activity, as protein kinase activity is the enzymatic 
activity by which protein phosphorylation occurs. The main difference 
is that while a molecular function annotation has a precise semantics 
(e.g., the gene carries out protein kinase activity), the biological pro- 
cess annotation does not (e.g., the gene either carries out, regulates, or 
is upstream of but necessary for a particular protein kinase activity). 


3 How Does the GO Relate to the Debate About the Meaning 
of Biological Function? 


GO concepts are designed to describe aspects (molecular activity, 
location of the activity, and larger biological programs) of the func- 
tions that a gene evolved to perform, i.e., selected effect functions. 
However, GO concepts may not always be applied that way. As a 
result, a given GO annotation may or may not be a statement about 
selected effect function. Note that while all biological programs are 
carried out by molecular activities, not all molecular activities nec- 
essarily contribute to a biological program. In principle, then, only 
those GO annotations that refer to biological programs can be 
considered to generally reflect selected effect functions. 

A GO molecular function annotation by itself cannot be 
automatically interpreted as selected effect function. One of the 
most vigorous long-standing debates in the GO Consortium con- 
cerns the protein binding class in GO, as it is clearly appreciated 
by biologists that a given experimental observation of molecular 
binding may reflect biological noise and not necessarily contribu- 
tion to a biological objective. Even further removed, cellular com- 
ponent annotations are often made from observations of a protein 
in a particular compartment, irrespective of whether the protein 
performs a molecular activity in that location. For example, many 
proteins known to act extracellularly are also observed in the Golgi 
apparatus as they await trafficking to the plasma membrane. In 
short, if the molecular activity and cellular location are not yet 
implicated in a biological program (that is itself clearly related to 
survival and reproduction), they cannot be said to have selected 
effect function. Strictly speaking, such annotations should be con- 
sidered as referring to candidate functions, rather than proper 
functions. 

Despite these theoretical considerations, most GO annotations 
are likely in practice to refer to selected effect functions. This is 
simply because most GO annotations are made from publications 
describing specific, small-scale molecular biology studies that focus 
on a particular biological program. In such studies, a biological 
objective (usually implicitly related to survival and reproduction) 
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has already been established in advance, and the paper describes the 
mechanistic activities of gene products in accomplishing that bio- 
logical objective. Large-scale studies, on the other hand, that mea- 
sure gene product activities or locations without reference to the 
biological program they are part of, should be considered as candi- 
date selected effect functions. This view would address the recent 
debate about gene function [10-12], initiated when the ENCODE 
(Encyclopedia of DNA Elements) project—a large-scale, hypothe- 
sis-free project to catalog biochemical activities across numerous 
regions of the human genome [13 ]—inappropriately claimed to 
have discovered proper functions. The GO Consortium is discuss- 
ing ways to help users distinguish between hypothesis-driven anno- 
tations (likely proper functions) from large-scale annotations 
(candidate functions). 


It has not generally been appreciated that the Gene Ontology con- 
cepts for describing aspects of gene function assume a specific 
model of how gene products act to achieve biological objectives. 
My aim here has been to describe this model, which, I hope, will 
darify how GO annotations should be properly used and inter- 
preted, as well as how the GO relates to biological function as 
discussed in both the philosophical and biological literature. 
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Chapter 3 


Primer on the Gene Ontology 


Pascale Gaudet, Nives Skunca, James C. Hu, and Christophe Dessimoz 


Abstract 


The Gene Ontology (GO) project is the largest resource for cataloguing gene function. The combination 
of solid conceptual underpinnings and a practical set of features have made the GO a widely adopted 
resource in the research community and an essential resource for data analysis. In this chapter, we provide 
a concise primer for all users of the GO. We briefly introduce the structure of the ontology and explain 
how to interpret annotations associated with the GO. 


Key words Gene Ontology structure, Evidence codes, Annotations, Gene association file (GAF), GO 
files, Function, Vocabulary, Annotation evidence 


1 Introduction 


The key motivation behind the Gene Ontology (GO) was the 
observation that similar genes often have conserved functions in 
different organisms [1]. Clearly, a common vocabulary was needed 
to be able to compare the roles of orthologous genes (and their 
products) across different species. The value of comparative studies 
of biological function across systems predates Jacques Monod's 
statement that “anything found to be true of E. coli must also be 
true of elephants" [2]. The Gene Ontology aims to produce a rig- 
orous shared vocabulary to describe the roles of genes across dif- 
ferent organisms [1]. The GO project consists of the Gene Ontology 
itself, which models biological aspects in a structured way, and 
annotations, which associate genes or gene products with terms 
from the Gene Ontology. Combining information from all organ- 
isms in one central repository makes it possible to integrate knowl- 
edge from different databases, to infer the functionality of newly 
discovered genes, and to gain insight into the conservation and 
divergence of biological subsystems. 

In this primer, we review the fundamentals of the GO project. 
The chapter is organised as answers to five essential questions: 
What is the GO? Why use it? Who develops it and provides 


Christophe Dessimoz and Nives Skunca (eds.), The Gene Ontology Handbook, Methods in Molecular Biology, vol. 1446, 
DOI 10.1007/978-1-4939-3743-1 3, O The Author(s) 2017 
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annotations? What are the elements of a GO annotation? And 
finally, how can the reader learn more about GO resources? 


2 What Is the Gene Ontology? 


The Gene Ontology is a controlled vocabulary of terms to repre- 
sent biology in a structured way. The terms are subdivided into 
three distinct ontologies that represent different biological aspects: 
Molecular Function (MF), Biological Process (BP), and Cellular 
Component (CC) [1]. These ontologies are non-redundant and 
share a common space of identifiers and a well-specified syntax. 

Terms are linked to each other by relations to form a hierarchical 
vocabulary (Chap. 1 [3]). This is often modelled as a graph in which 
the relationships form the directed edges, and the terms are the 
nodes (Fig. 1). Since each term can have multiple relationships to 
broader parent terms and to more specific child terms, the structure 
allows for more expressivity than a simple hierarchy. 


GO:0008150 
biological_process 


^ 


= 


GO:0044699 GO:0071840 
single-organism Q0:0009987 cellular component GO:0065007 
process cellular process organization or biological regulation 
biogenesis 


x 


GO:0044085 
cellular component 
biogenesis 


GO:0016043 
cellular component 
organization 


GO:0050789 
regulation of 
biological process 


* 


GO:0044087 
regulation of 
cellular component 
biogenesis 


GO:0044763 
single-organism 
cellular process 


GO:0030030 
cell projection 
organization 


GO:0050794 
regulation of 
cellular process 


T 


GO:0051128 
regulation of 
cellular component 
organization 


A 


GO:0031344 
regulation of cell 
projection 
organization 


us 


—> regulates GO:0060491 


regulation of cell 
projection assembly 


GO:0022607 
cellular component 
assembly 


WE, 


GO:0030031 
cell projection 
assembly 


————À- part of 


Fig. 1 The structure of the Gene Ontology (GO) is illustrated on a subset of the paths of the term "regulation of 
cell projection assembly", G0:0060491, to its root term. The GO is a directed graph with terms as nodes and 
relationships as edges; these relationships are either is a, part. of, has part, or regulates. In its basic repre- 
sentation, there should be no cycles in this graph, and we can therefore establish parent (more general) and 
child (more specific) terms (Chap. 11 [4] for more details on the different representations). Note that it is pos- 
sible for a term to have multiple parents. This figure is based on the visualisation available from the AmiGO 
browser, generated on November 6, 2015 [5] 
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The full GO is large: in October 2015, the full ontology specifi- 
cation had 43835 terms, 73776 explicitly encoded is_a relationships, 
7436 explicitly encoded part_of relationships, and 8263 explicitly 
encoded regulates, negatively_regulates, or positively_regulates rela- 
tionships. This level of detail is not necessary for all applications. 
Many research groups who do GO annotations for specific projects 
use the generic GO-slim file, which is a manually curated subset of 
the Gene Ontology containing general, high-level terms across all 
biological aspects. There are several GO slims,' ranging from the 
general Generic GO slim developed by the GO Consortium to more 
specific ones, such as the Chembl Drug Target slim.’ 

To keep up with the current state of knowledge, as well as to 
correct inaccuracies, the GO undergoes frequent revisions: changes 
of relationships between terms, addition of new terms, or term 
removal (obsoletion). Terms are never deleted from the ontology, 
but their status changes to obsolete and all relationships to the 
term are removed [6]. Furthermore, the name itself is preceded by 
the word “obsolete” and the rationale for the obsoletion is typi- 
cally found in the Comment field of the term. An example of an 
obsolete term is GO:0000005, “obsolete ribosomal chaperone 
activity”. This MF GO term was made obsolete “because it refers 
to a class of gene products and a biological process rather than a 
molecular function”.* Changes to the relationships do not impact 
annotations, because annotations are associated with a given GO 
term regardless of its relationships to other terms within the GO. 
Obsoletion of terms however has an impact on annotations associ- 
ated with them: in some cases, the old term can be automatically 
replaced by a new or a parent one; in others, the change is so 
important that the annotations must be manually reviewed. 

However, these changes can affect the analyses done using the 
ontology. In articles or reports, it is good practice to provide the 
version of the file used for a particular analysis. In GO, the version 
number is the date the file was obtained from the GO site (GO files 
are updated daily). 


3 Why Use the Gene Ontology? 


Because it provides a standardised vocabulary for describing gene 
and gene product functions and locations, the GO can be used to 
query a database in search of genes’ function or location within the 
cell or to search for genes that share characteristics [7]. The hierar- 
chical structure of the GO allows to compare proteins annotated to 
different terms in the ontology, as long as the terms have 


i http://geneontology.org/page/go-slim-and-subset-guide 
: http://wwwdev.ebi.ac.uk/chembl/target/browser 
https:/ /www.ebi.ac.uk/QuickGO/GTerm?id-GO:0000005 
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relationships to each other. Terms located close together in the 
ontology graph (i.e. with a few intermediate terms between them) 
tend to be semantically more similar than those further apart (see 
Chap. 12 on comparing terms [8 ]). 

The GO is frequently used to analyse the results of high- 
throughput experiments. One common use is to infer common- 
alities in the location or function of genes that are over- or 
under-expressed [6, 9, 10]. In functional profiling, the GO is 
used to determine which processes are different between sets of 
genes. This is done by using a likelihood-ratio test to determine 
if GO terms are represented differently between the two gene 
sets [6]. 

Additionally, the GO can be used to infer the function of unan- 
notated genes. Gene predictions with significant similarity to anno- 
tated genes can be assigned one or several of the functions of the 
characterised genes. Other methods such as the presence of specific 
protein domains can also be used to assign GO terms [11, 12]. 
This is discussed in Chap. 5 [13]. 

A wealth of tools—web-based services, stand-alone software, 
and programing interfaces—has been developed for applying the 
GO to various tasks. Some of these are presented in Chap. 11 [4]. 

While Gene Ontology resources facilitate powerful inferences 
and analyses, researchers using the GO should familiarise them- 
selves with the structure of the ontology and also with the methods 
and assumptions behind the tools they use to ensure that their 
results are valid. Common pitfalls and remedies are detailed in 
Chap. 14 [14]. 


4 Who Develops the GO and Produces Annotations? 


The GO Consortium consists of a number of large databases 
working together to define standardised ontologies and provide 
annotations to the GO [15]. The groups that constitute the GO 
consortium include UniProt [16], Mouse Genome Informatics 
[17], Saccharomyces Genome Database [18], Wormbase [19], 
Flybase [20], dictyBase [21], and TAIR [22]. In addition, several 
other groups contribute annotations, such as EcoCyc [23] and 
the Functional Gene Annotation group at University College 
London [24].* Within each group, biocurators assign annota- 
tions according to their expertise [25]. Further, the GO 
Consortium has mechanisms by which members of the broader 
community (see Chap. 7 [26]) can suggest improvements to the 
ontology and annotations. 


Full list at http://geneontology.org/page/ go-consortium-contributors-list 
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5 What Are the Elements of a GO Annotation? 


5.1 Annotation 
Object 


This section describes the different elements composing an anno- 
tation and some important considerations about each of them. The 
annotation process from a curator standpoint is discussed in detail 
in Chap. 4 [27]. 

Fundamentally, a GO annotation is the association of a gene 
product with a GO term. From its inception, the GO Consortium 
has recognised the importance of providing supporting infor- 
mation alongside this association. For instance, annotations 
always include information about the evidence supporting the 
annotation. 

Over time, the GO Consortium standards for storing annota- 
tions have evolved to improve this representation. Annotations are 
now stored in one of the two formats: GAF (Gene Association 
File), and the more recent GPAD (Gene Product Association 
Data). The two formats contain the same information but there are 
differences in how the data is normalised and represented (discussed 
in more details in Chap. 11 [4]). In this primer, we focus on the 
former. The representation of an annotation in the GAF file format 
2.1 is shown in Fig. 2. It contains 17 fields (also sometimes referred 
to as “columns”). We describe them in this section. 


The annotation object is the entity associated with a GO term—a 
gene, a protein, a non-protein-coding RNA, a macromolecular 
complex, or another gene product. Seven fields of the GAF file 
specify the annotation object. Each annotation in the GO is associ- 
ated with a database (field 1) and a database accession number 
(field 2) that together provide a unique identifier for the gene, the 
gene product, or the complex. For example, the protein record 
P00519 is a database object in the UniProtKB database (Fig. 2). 
The database object symbol (field 3), the database object name 
(field 10), and the database object synonyms (field 11) provide 
additional information about the annotation object. The database 
object type specifies whether the object being annotated is a gene, 
or a gene product (e.g. protein or RNA; field 12). The organism 
from which the annotation object is derived is captured as the 
NCBI taxon ID (taxon; field 13); the corresponding species name 
can be found at the NCBI taxonomy website." 

GO allows capturing isoform-specific data when appropriate; 
for example UniProtKB accession numbers P00519-1 and 
P00519-2 are the isoform identifiers for isoform 1 and 2 of 
P00519. In this case, the database ID still refers to the main iso- 
form, and an isoform accession is included in the GAF file as “Gene 
Product Form ID" (field 17). 


f http://www.ncbi.nlm.nih.gov/taxonomy 
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1. UniProtKB {1} 


2. PO0519 (1) 


|4. NOT {f} 
{+} 


* Database from which the identifier in column 2 is derived. 


* Identifier in the database denoted in column 1. 


* Database object symbol; whenever possible, this entry is assigned 
such that it is interpretable by a biologist. 


* Flags that modify the interpretation of an annotation. 


* The GO identifier. 


* One or more identifiers for the authority behind the annotation: e.g., 
PMID, GO Reference Code, or a database reference. 


* Evidence code; one of the codes listed in Figure 2. 


* The content depends on the evidence code used and contains 
more information on the annotation. 


Zero, one, or more of: NOT (negates the 
annotation), contributes to (when the gene 
product is part of a complex), and 


=m colocalizes with (only used for the CC 


ontology). 


Different content is possible: 

- GO ID is used in conjunction with evidence 
code Inferred by Curator (IC) to denote the 
GO term from which the inference is 
made. 

- Gene product ID is used in conjunction 
with evidence codes IEA, IGI, IPI, and ISS. 
For example, in conjunction with the 
evidence code Inferred from Sequence 
Similarity (ISS), it identifies the gene 
product, similarity to which was the basis 


for the annotation. 


10. acid 
phosphatase (?) 


11. YBRO92C (*) 


* The ontology or aspect to which the GO term in column 5 belongs to. s= 
C is Cellular Component, P is Biological 
Process, and M is Molecular Function. 


* Name of the gene or the gene product. 


* Synonym for the identifier denoted in column 2 for the database in 


For single-organism terms, the NCBI 
column 1. 


taxonomy ID of the respective organism. For 
multi-organism terms, this column is used 
either in conjunction with a BP term that is a 
multi-organism process or CC term that is a 
host cell, in which case there are two pipe- 
separated NCBI taxonomy IDs: the first 

= T " 
denotes the organism encoding the gene or 
the gene product; the second denotes the 
organism in the interaction. 


* The type of object denoted in column 2, e.g., gene, transcript, 
protein, or protein structure. 


12. gene {1} 


13. taxon:4932 
{1,2} 


14. 20010118 {1} 
15. SGD (1) 


16. part of 
(CL:0000084) (*) 


P00519-2 (?) 
Fig. 2 Gene Association File (GAF) 2.1 file format described with example elements. In the GAF file, each row 
represents an annotation, consisting of up to 17 tab-delimited fields (or columns). This figure describes these 
fields in the order in which they are found in the GAF file. Light blue colour denotes non-mandatory fields, and 
these are allowed to be empty in the GAF file. The cardinality—the number of elements in the field—is 
denoted with the symbol(s) in curly brackets: (?) indicates cardinality of zero or one; {*} indicates that any 
cardinality is allowed; {+} indicates cardinality of one or more; (1) indicates that cardinality is exactly one; {1,2} 
indicates that cardinality is either one or two. When cardinality is greater than 1, elements in the field are sepa- 


rated with a pipe character or with a comma; the former indicates “OR” and the latter indicates “AND”. The GO 
term assigned in column 5 is always the most specific GO term possible 


* The NCBI ID of the respective organism(s). 


* Date on which the annotation was made; note that IEA annotations 


are re-calculated with every database release. Parakttiee tatio GO cosestina cem nete 


inferences about any organism, so it is not 
== obligatory that the field 13 corresponds to 
the field 15. 


* The database asserting the annotation. 


Cross references to GO or other ontologies that 
can enhance the annotation. 


* Annotation extension. 


This field allows the annotation of specific 


* Gene Product Form ID. = variants of that gene or gene product. 


5.2 GO Term, 
Annotation Extension, 
and Qualifier 


Three fields are used to specify the function of the annotation 
object. Field 5 specifies the GO term, while field 9 denotes the 
sub-ontology of GO, either Molecular Function, Biological 
Process, or Cellular Component. While this information is also 
encoded in the GO hierarchy, explicitly denoting the sub-ontol- 
ogy allows to simplify parsing of the annotations according to the 
GO aspect. Field 4 denotes the qualifier. One of the three quali- 
fiers can modify the interpretation of an annotation: “contributes_ 
to”, *colocalizes with" and “NOT”. This field is not mandatory, 
but if present it can profoundly change the meaning of an 


5.3 Evidence Code 
and Reference Field 


5.3.1 Experimentally 
Supported Annotations 
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annotation [6]. Thus, while the producers of annotations may 
omit qualifiers, applications that consume GO annotations must 
take them into account. The importance of qualifiers is discussed 
in more detail in Chap. 14 [14]. 

An additional field, field 16, is a recent addition to combine 
more than one term or concept (protein, cell type, etc.) in the 
same annotation. For example,° if a gene product Slpl is localised 
to the plasma membrane of T-cells, the GAF file field 16 would 
contain the information “part_of(CL:0000084 T cell)". Here, 
CL:0000084 is the identifier for T-cell in the OBO Cell Type (CL) 
Ontology. This is covered in details in Chap. 17 [28] on annota- 
tion extensions. 


Three fields in the GAF file describe the evidence used to assert the 
annotation: the Reference (field 6), the Evidence Code (field 7), 
and the With/From (field 8). The Evidence Code informs the type 
of experiment or analysis that supports the annotation. There are 
21 evidence codes, which can be grouped in three broad catego- 
ries: experimental annotations, curated non-experimental annota- 
tions, and automatically assigned (also known as electronic) 
annotations (Fig. 3). The Reference field specifies more details on 
the source ofthe annotation. For example, when the evidence code 
denotes an experimentally supported annotation, the Reference 
will contain the PubMed accession ID (or a DOI if no PubMed ID 
is available) of the journal article which underpins the annotation, 
or a GO REF identifier that refers to a short description of the 
assignment method, accessible on the GO website." When the evi- 
dence code denotes an automatically assigned annotation, i.e. IEA, 
the reference will contain GO REF identifiers that specify more 
details on the automatic assignment, e.g. annotation via the 
InterPro resource [29]. 


Annotations based on direct experimental evidence found in the 
primary literature are denoted with the general evidence code EXP 
(Inferred from Experiment) or, when appropriate, the more spe- 
cific evidence codes IDA (Inferred from Direct Assay), IPI (Inferred 
from Physical Interaction), IMP (Inferred from Mutant Phenotype), 
IGI (Inferred from Genetic Interaction), and IEP (Inferred from 
Expression Pattern) (Fig. 3). These annotations are held in high 
regard by the community, e.g. [30], and are often used in applica- 
tions such as checking the enrichment of a gene set in particular 
functions, finding genes that perform a specific function, or assess- 
ing involvement in specific pathways or processes. 


i http://wiki.geneontology.org/index.php/Annotation Extensionz The basic . 
format 
http://www.geneontology.org/cgi-bin/references.cgi 
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Experimental 
annotations (EXP) 


Inferred from Direct Assay (IDA) 


Inferred from Physical Interaction (IP!) 


Inferred from Mutant Phenotype (IMP) 


Inferred from Genetic Interaction (IGI) 


Inferred from Expression Pattern (IEP) 


Curated non-experimental annotations 


Inferred from Sequence or Structural Traceable Author Statement (TAS) 
Similarity (ISS) 


* Inferred from Sequence Orthology (ISO) 
* Inferred from Sequence Alignment (ISA) Non-traceable Author Statement (NAS) 


* Inferred from Sequence Model (ISM) 


No biological Data available (ND) 


Inferred from Genomic Context (IGC) 


Inferred by Curator (IC) 
Inferred from Phylogenetic Evidence 
» Inferred from Biological aspect of 
Ancestor (IBA) 
e Inferred from Biological aspect of 


Descendant (IBD) Automatically assigned 


* Inferred from Key Residues (IKR) 


* Inferred from Rapid Divergence (IRD) a nn ot ati on S 


Inferred from Reviewed Computational Inferred from Electronic Annotation (IEA) 
Analysis (RCA) 


Fig. 3 GO Evidence Codes and their abbreviations. The type of information supporting annotations is recorded 
with Evidence Codes, which can be grouped into three main categories: experimental evidence codes, curated 
non-experimental annotations, and automatically assigned annotations. The obsolete evidence code NR (Not 
Recorded) is not included in the figure. Documentation about the different types of automatically assigned 
annotations can be found at http://www.geneontology.org/doc/GO.references 


5.3.2 Curated Non- 
experimental Annotations 


Another important use of experimentally supported annota- 
tions is in providing trustworthy training sets for various computa- 
tional methods that infer function [31]. Used this way, the 
experimentally supported annotations can be amplified to under- 
stand more of the growing set of newly sequenced genes. 


Fourteen of the 21 evidence codes are associated with manually 
curated non-experimental annotations. Annotations associated 
with these codes are curated in the sense that every annotation is 
reviewed by a curator, but they are non-experimental in the sense 
that there is no direct experimental evidence in the primary litera- 
ture underpinning them; instead, they are inferred by curators 
based on different kinds of analyses. 

ISS (Inferred from Sequence or Structural Similarity) is a 
superclass (i.e. a parent) of ISA (Inferred from Sequence 
Alignment), ISO (Inferred from Sequence Orthology), and ISM 
(Inferred from Sequence Model) evidence codes. Each of the three 
subcategories of ISS should be used when only one method was 
used to make the inference. For example, to improve the accuracy 
of function propagation by sequence similarity, many methods take 
into account the evolutionary relationships among genes. Most of 
these methods rely on orthology (ISO evidence code), because the 
function of orthologs tends to be more conserved across species 
than paralogs [32, 33]. In a typical analysis, characterised and 
uncharacterised genes are clustered based on sequence similarity 
measures and phylogenetic relationships. The function of unknown 
genes is then inferred from the function of characterised genes 
within the same cluster (e.g. [34, 35]). 
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Another approach to function prediction entails supervised 
machine learning based on features derived from protein sequence 
[36-39] (ISM evidence code). Such approach uses a training set of 
classified sequences to learn features that can be used to infer gene 
functions. Although few explicit assumptions about the complex 
relationship between protein sequence and function are required, 
the results are dependent on the accuracy and completeness of the 
training data. 

IGC (Inferred from Genomic Context) includes, but is not 
limited to, such things as identity of the genes neighbouring the 
gene product in question (i.e. synteny), operon structure, and phy- 
logenetic or other whole-genome analysis. 

Relatively new are four evidence codes associated with phylo- 
genetic analyses. IBA (Inferred from Biological aspect of Ancestor) 
and IBD (Inferred from Biological aspect of Descendant) indicate 
annotations that are propagated along a gene tree. Note that the 
latter is only applicable to ancestral genes. The loss of an active site, 
a binding site, or a domain critical for a particular function can be 
annotated using the IKR (Inferred from Key Residues) evidence 
code. When this code is assigned by PAINT, GO’s Phylogenetic 
Annotation and INference Tool [40], this means that it is a predic- 
tion based on evolutionary neighbours. Finally, negative annota- 
tions can be assigned to highly divergent sequences using the code 
IRD (Inferred from Rapid Divergence). 

RCA (inferred from Reviewed Computational Analysis) cap- 
tures annotations derived from predictions based on computa- 
tional analyses of large-scale experimental data sets, or based on 
computational analyses that integrate datasets of several types, 
including experimental data (e.g. expression data, protein-protein 
interaction data, genetic interaction data), sequence data (e.g. pro- 
moter sequence, sequence-based structural predictions), or math- 
ematical models. 

Next, there are two types of annotations derived from author 
statements. Traceable Author Statement (TAS) refers to papers 
where the result is cited, but not the original evidence itself, such 
as review papers. On the other hand a NAS (Non-traceable Author 
Statement) refers to a statement in a database entry or statements 
in papers that cannot be traced to another paper. 

The final two evidence codes for curated non-experimental 
annotations are IC (Inferred by Curator) and ND (No biological 
Data available). If an assignment of a GO term is made using the 
curator’s expert knowledge, concluding from the context of the 
available data, but without any direct evidence available, the IC 
evidence code is used. For example, if a eukaryotic protein is anno- 
tated with the MF term “DNA ligase activity”, the curator can 
assign the BP term “DNA ligation” and CC term “nucleus” with 
the evidence code IC. 
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5.3.3 Automatically 
Assigned Annotations 


5.3.4 Additional 
Considerations 
About Evidence Codes 


5.4 Uniqueness 
of GO Annotations (or 
Lack Thereof) 


The ND evidence code indicates that the function is currently 
unknown (i.e. that no characterisation of the gene is currently 
available). Such an annotation is made to the root of the respective 
ontology to indicate which functional aspect is unknown. Hence, 
the ND evidence code allows users for a subtle difference between 
unannotated genes (for which the literature has not been com- 
pletely reviewed and thus no GO annotation has been made) and 
uncharacterised genes (GO annotation with ND code). Note that 
the ND code is also different from an annotation with the “NOT” 
qualifier (which indicates the absence of a particular function). 


The evidence code IEA (Inferred from Electronic Annotation) is 
used for all inferences made without human supervision, regardless 
of the method used. IEA evidence code is by far the most abun- 
dantly used evidence code. The guiding idea behind computational 
function annotation is the notion that genes with similar sequences 
or structures are likely to be evolutionarily related, and thus, 
assuming that they largely kept their ancestral function, they might 
still have similar functional roles today. For an in-depth discussion 
of computational methods for GO function annotations, refer to 
Chap. 5 or see refs. [13, 41]. 


Biases associated with the different evidence codes are discussed in 
Chap. 14. Note that there is a more extensive Evidence and 
Conclusion Ontology (ECO; [42 ]), formerly known as the “Evidence 
Code Ontology”, presented in Chap. 18 [43] . ECO is only partially 
implemented in the GO: ECOs are displayed in the AmiGO browser, 
but they are not in the GAF file. However, all Evidence Codes used 
by the GO are found also in ECO. There is a general assumption 
among the GO user community that annotations based on experi- 
ments are of higher quality compared to those generated electroni- 
cally, but this has yet to be empirically demonstrated. Generally, 
annotations derived from automatic methods tend to be to high-level 
terms, so they may have a lower information value, but they often 
withstand scrutiny. Conversely, experiments are sometimes over- 
interpreted (see Chap. 4 [27 ]) and can also contain inaccuracies. 


No two annotations can have the same combination of the follow- 
ing fields: gene/protein ID, GO term, evidence code, reference, 
and isoform. Thus one gene can be annotated to the same term 
with more than one evidence code. 

Most GO analyses are gene based, and therefore it is important in 
such analyses to make sure that the list of genes is non-redundant. 
However, annotations are often made to larger protein sets that include 
multiple proteins from the same gene. This is particularly evident in 
UniProt, which can contain distinct entries from the TrEMBL (unre- 
viewed) portion of the database that do not necessarily represent bio- 
logically distinct proteins. The different entries for the same protein or 
gene are often annotated with identical GO terms, which can bias 
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statistical analyses because some genes have many more entries than 
other genes. For instance, the set of human proteins in UniProt com- 
prises over 70,000 entries, but there are only approximately 20,000 
recognised human protein-coding genes (20,187 reviewed human 
proteins in the UniProt release of 2015_12). The GO Consortium has 
worked with UniProt as well as the Quest for Orthologs Consortium 
to develop “gene-centric” reference proteome lists (http://www.uni- 
prot.org/proteomes/) that provide a single “canonical” UniProt 
entry for each protein-coding gene. These lists are available for many 
species, and we encourage users performing gene-centric GO analyses 
to use only the annotations for UniProt entries in these lists. 


6 How Can I Learn More About Gene Ontology Resources? 


Most of the topics introduced in this primer will be treated in more 
depth and nuance in later chapters. Part II focuses on the creation 
of GO function annotations—we cover in depth the two main 
strategies of creating GO function annotations: manual extrac- 
tion/curation from the literature and computational prediction. 
Part III describes the main strategies used to evaluate their predic- 
tive performance. Part IV covers practical uses of the GO annota- 
tions: we discuss how GO terms and GO annotations can be 
summed and compared, how enrichment in specific GO terms can 
be analysed, and how the GO annotations can be visualised. For 
the advanced GO user, Part V discusses how the context of a GO 
annotation is recorded and goes beyond the Evidence Codes to 
describe how to capture more information on the source of an 
annotation. We end with Part VI by going beyond GO: we present 
alternatives to GO for functional annotation; we show how a struc- 
tured vocabulary is used in the context of controlled clinical termi- 
nologies; and we present how information from different structured 
vocabularies is integrated in one overarching resource. 
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Part Il 


Making Gene Ontology Annotations 


Chapter 4 


Best Practices in Manual Annotation with the Gene 
Ontology 


Sylvain Poux and Pascale Gaudet 


Abstract 


The Gene Ontology (GO) is a framework designed to represent biological knowledge about gene prod- 
ucts’ biological roles and the cellular location in which they act. Biocuration is a complex process: the body 
of scientific literature is large and selection of appropriate GO terms can be challenging. Both these issues 
are compounded by the fact that our understanding of biology is still incomplete; hence it is important to 
appreciate that GO is inherently an evolving model. In this chapter, we describe how biocurators create 
GO annotations from experimental findings from research articles. We describe the current best practices 
for high-quality literature curation and how GO curators succeed in modeling biology using a relatively 
simple framework. We also highlight a number of difficulties when translating experimental assays into GO 
annotations. 


Key words Gene ontology, Expert curation, Biocuration, Protein annotation 


1 Background 


Biological databases have become an integral part of the tools 
researchers use on a daily basis for their work. GO is a controlled 
vocabulary for the description of biological function, and is used 
to annotate genes in a large number of genome and protein data- 
bases. Its computable structure makes it one of the most widely 
used resources. Manual annotation with GO involves biocurators, 
who are trained to reading, extracting, and translating experi- 
mental findings from publications into GO terms. Since both the 
scientific literature and the GO are complex, novice biocurators 
can make errors or misinterpretations when doing annotation. 
Here, we present guidelines and recommendations for best prac- 
tices in manual annotation, to help curators avoid the most com- 
mon pitfalls. These recommendations should be useful not only 
to biocurators, but also to users of the GO, since the understand- 
ing of the curation process should help understand the meaning 
of the annotations. 
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1.1 Knowledge 
Inference: General 
Principles 


Our understanding of the world is built by observation and exper- 
imentation. The overall process of the scientific method involves 
making hypotheses, deriving predictions from them, and then 
carrying out experiments to test the validity of these predictions. 
The results of the experiments are then used to infer whether the 
prediction was true or not [1]. Hypotheses are tested, validated, 
or rejected, and the combination of all the experiments contrib- 
utes to uncovering the mechanism underlying the process being 
studied (Fig. 1). 

Examples of experiments include testing an enzymatic activity 
in vitro using purified reagents, measuring the expression level of 
a protein upon a given stimulus, or observing the phenotypes of 
an organism in which a gene has been deleted by molecular genet- 
ics techniques. Different inferences can be made from the same 
experimental setup depending on the hypothesis being tested. 
Thus, the conclusions that can be derived from individual experi- 
ments may vary, depending on a number of factors: they depend 
on the current state of knowledge, on how well controlled the 
experiment is, on the experimental conditions, etc. It also hap- 
pens that the conclusions from a low-resolution experiment are 
partially or completely refuted when better techniques become 
available. These factors are inherent to empirical studies and 
must be taken into account to ensure correct interpretation of 


experimental results. 


Experimental result 


Hypothesis rejected 


Hypothesis 
confirmed 


Fig. 1 How the scientific method is used to test and validate hypotheses 


1.2 Knowledge 
Representation Using 
Ontologies 


1.3 Methods 
for Assigning 
GO Annotations 
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GO is a framework to describe the roles of gene products across all 
living organisms [2] (see also Chap. 2, [3]). The ontology is divided 
into three branches, or aspects: Molecular Function (MF) that cap- 
tures the biochemical or molecular activity of the gene product; 
Biological Process (BP), corresponding to the wider biological 
module in which the gene product's MF acts; and Cellular 
Component (CC), which is the specific cellular localization in 
which the gene product is active. 

The association of a GO term and a gene product is not explic- 
itly defined, but implicitly means that the gene product has an 
activity or a molecular role (MF term), directly participates in a 
process (BP), and the function takes place im a specific cellular 
localization (CC) [2]. Therefore, transient localizations such as 
endoplasmic reticulum and Golgi apparatus for secreted proteins 
are not in the scope of GO. Biological process is the most challeng- 
ing aspect of the GO to capture, in part because it models two 
categories of processes: subtypes “mitotic DNA replication” 
(GO:1902969) is a particular type of “nuclear DNA replication” 
(GO:0033260), and sub-processes: mitotic DNA replication is a step 
of the “cell cycle? (GO:0000278). These two classification axes are 
distinguished by “is a” and “part of" relations with their parents, 
respectively. Gene products can be annotated using as many GO 
terms as necessary to completely describe its function, and the GO 
terms can be at varying levels in the hierarchy, depending on the 
evidence available. Ifa gene product is annotated to any particular 
term, then the annotations also hold for all the is-a and part-of par- 
ent terms. Annotations to more granular terms carry more infor- 
mation; however the annotation cannot be any deeper than what is 
supported by the evidence. 

The complexity of biology is reflected in the GO: with 40,000 
different terms [4], learning to use the GO can be compared to 
learning a new language. As when learning a language, there are 
terms that are closely related to those we are familiar with, and oth- 
ers that have subtle but important differences in meaning. The GO 
defines each term in two complementary ways: first by a textual 
definition intended to be human readable. Secondly, the structure 
of the ontology as determined by relationships of terms between 
each other is also a way by which terms are defined these can be 
utilized for computational reasoning. 


There are two general methods for assigning GO terms to gene 
products. The first is based on experimental evidence, and involves 
detailed reading of scientific publications to capture knowledge 
about gene products. Biocurators browse the GO ontologies to 
associate appropriate GO term(s) whose definition is consistent 
with the data published for the gene product. See Chaps. 3 [5] and 
17 [6], for a description of the elements of an annotation. Expert 
curation based on experiments is considered the gold standard of 
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functional annotation. It is the most reliable and provides strong 
support for the association of a GO term with a gene product. 

The second method involves making predictions on the pro- 
tein’s function and subcellular localization, most often with meth- 
ods relying on sequence similarity. Although not detailed in this 
chapter, prediction methods are highly dependent on annotations 
based on experiments. Indeed, all methods to assign annotations 
based on sequence similarity are more or less directly derived from 
knowledge that has been acquired experimentally; that is, at least 
one related protein must have been tested and shown to have a 
given function for that information to be propagated to other pro- 
teins. Hence, the accurate assignment of GO classes to gene prod- 
ucts based on experimental results is crucial, since many further 
annotations depend on their accuracy. 


2 Best Practices for High-Quality Manual Curation 


2.1 GO Inference 
Process 


Similar to the process by which experimental results get trans- 
lated into a model of the biological phenomenon being investi- 
gated, biocurators take the conclusions from the investigation 
and convert it into the GO framework. Thus, the same assay may 
lead to different interpretations depending on the question being 
tested. 

As shown in Table 1, an assay must be interpreted in the wider 
context of the known roles of the protein, and how directly the 
assay assesses the protein’s role in the process under investigation. 
Here, several experiments are described in which the readout is 
DNA fragmentation upon apoptotic stimulation, but that lead to 
different annotations. DFFB (UniProtKB 076075) is annotated to 
“apoptotic DNA fragmentation” (GO:0006309) because the pro- 
tein is also known to be a nuclease. CYCS (UniProtKB P99999) is 
annotated to caspase activation (“activation of cysteine-type endo- 
peptidase activity involved in apoptotic process” (GO:0006919)) 
because a direct role has been shown using an in vitro assay. 
However CYCS is not annotated to “apoptotic DNA fragmenta- 
tion” (GO:0006309) despite the observation that removing it 
from cells prevents DNA fragmentation, since the activity of CYCS 
occurs before DNA fragmentation. Any step that takes place after- 
wards will inevitably fail to happen, but this does not imply partici- 
pation in this downstream sequence of molecular events. Finally, 
the FOXL2 (UniProtKB P58012) transcription factor has a posi- 
tive effect on the occurrence of apoptosis, by an unknown mecha- 
nism, so it is annotated to “positive regulation of apoptotic process” 
(GO:0043065). This is where the curator’s knowledge is critical 
and provides most added value over, e.g., machine learning and 
text mining 


Table 1 
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GO inference process, from the hypothesis in the paper to the assay and result, and to the inference 
of a GO function or role 


Protein Known roles Hypothesis Assay — Result Conclusion — GO Reference 
DDFB DNase The nuclease Apoptotic DNA DDEB mediates [7] 
(076075) activity of DDFB fragmentation nuclear DNA 
is required for — Increased in the fragmentation 
nuclear DNA presence of during apoptosis 
fragmentation DDFB — Apoptotic DNA 
during apoptosis fragmentation 
(GO:0006309) 
CYCS Cytochrome CYCS triggers the Apoptotic DNA CYCS directly [8] 
(P99999) C; electron activation of fragmentation activates 
transport caspase-3 — Decreased upon caspase-3 
immunodepletion —Activation of 
of CYCS 7 cysteine-type 
Purified CYCS endopeptidase 
—Stimulates the activity involved 
auto-proteolytic in apoptotic 
activity of process 
caspase-3 (GO:0006919) 
FOXL2 Transcription Mutations in Apoptotic DNA FOXL2 increases [9] 
(P58012) factor FOXL2 are fragmentation the rate of 
known to cause  —Increased in the apoptosis 
premature presence of — Positive 
ovarian failure, FOXL2 regulation of 
which may be apoptotic 
due to increased process 


apoptosis 


(GO:0043065 ) 


22 Needles 
and Haystacks 


With more than 500,000 records indexed yearly in PubMed, it is 
not possible for the GO to comprehensively represent all the avail- 
able data on every protein. To address this, a careful prioritization 
of both articles and proteins to annotate is done. The publications 
from which information is drawn are selected to accurately repre- 
sent the current state of knowledge. Accessory findings and non- 
replicated data are not systematically annotated; confirmation or at 
least consistency with findings from several publications is invalu- 
able to accurately describe the function of a gene product. 
Focusing on a topic allows the curator to construct a clear pic- 
ture of the protein’s role and makes it easier to make the best deci- 
sions when capturing biological knowledge as annotations. Reading 
different publications in the field helps to resolve issues and select 
terms with more confidence. Existing GO annotation in proteins 
that participate in the same biological process is also helpful to 
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2.3 How Low Can 
You Go: Deciding 
on the Level 

of Granularity 

of an Annotation 


24 Less Is More: 
Avoiding 
Over-Interpretation 


2.4.1 Biological 
Relevance of Experiments 


24.2 Downstream 
Effects 


decide on how best to represent the experimental data with the GO. 
On the other hand, without the broader context of the research 
domain, some papers may be misleading: first, as more data accu- 
mulate, a growing number of contradictory or even incorrect 
results are found in the scientific literature. Second, the way knowl- 
edge evolves occasionally obsoletes previous findings. Curators use 
their expertise to assess the scientific content of articles and avoid 
these pitfalls [10]. 


The level of granularity of an annotation is dictated by the evidence 
supporting it. A good illustration is provided by ADCK3 protein in 
human (UniProtKB Q8NI60), an atypical kinase containing a pro- 
tein kinase domain involved in the biosynthesis of ubiquinone, and 
an essential lipid-soluble electron transporter. Although it contains 
a protein kinase domain, it is unclear whether it acts as a protein 
kinase that phosphorylates other proteins in the CoQ complex or 
acts as a lipid kinase that phosphorylates a prenyl lipid in the ubi- 
quinone biosynthesis pathway [11 ]. While it would be tempting to 
conclude that the protein has “protein kinase activity” 
(GO:0004672) from the presence of the protein kinase domain, 
the more general term “kinase activity” (GO:0016301) with no 
specification of the potential substrate class (lipid or protein) is 
more appropriate. 


Annotations focus on capturing experiments that are biologically 
relevant. Thus, substrates, tissue, or cell-type specificity are anno- 
tated only when the data indicates the physiological importance of 
these parameters. One difficulty is that it is not always possible to 
distinguish between experimental context and biological context, 
which can potentially result in GO terms being assigned as if they 
represented a specific role or under specific conditions, while in 
fact this only reflects the experimental setup and does not have real 
biological significance. For example, the activity of E3 ubiquitin 
protein ligases is commonly tested by an in vitro autoubiquitina- 
tion assay. While convenient, the assay is not conclusive with 
respect to the “protein autoubiquitination” (GO:0051865) 
in vivo. In the absence of additional data, only the term “ubiquitin 
protein ligase activity” (GO:0061630) should be used. Similarly, 
the cell type in which a function was tested does not imply that the 
cell type is relevant for the function; any hint that the protein is 
studied outside its normal physiological context (such as overex- 
pression) should be carefully taken into consideration. 


Downstream effects, as well as readouts (discussed above in 
Subheading 2.1), can lead to incorrect annotations if they are 
directly assigned to a gene product playing a role many steps 
further. Here we use downstream as “occurring after,” with no 
implication on the direct sequentiality of the events. 


24.3 Phenotypes 
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H2B 
RNF20 


Y (E3 ubiquitin ligase) 


H2BK120ub - H3 complex 


V N methylases 


H3K4me H3K79me 


Fig. 2 Monoubiquitination of histone H2B (H2BK120ub) promotes methylation of 
histone H3 (H3K4me and H3K79me) 


Gene products that play housekeeping functions or function 
upstream of important signaling pathways have many indirect 
effects and pose a challenge for annotation. This can be illustrated 
by proteins that mediate chromatin modification. Histone tails are 
posttranslationally modified by a complex set of interdependent 
modifications. For instance, histone H2B monoubiquitination at 
Lys-120 (H2BK120ub) is a prerequisite for the methylation of 
histone H3 at Lys-4 and Lys-79 (H3K4me and H3K79me, respec- 
tively) (Fig. 2). RNF20 (UniProtKB Q5VTR2), an E3 ubiquitin 
ligase that mediates H2BK120ub, therefore indirectly promotes 
H3K4me and H3K79me methylation [12]. Thus, the annotation 
of enzymes that modify histone tails is limited to the primary 
function of the enzyme (“ubiquitin-protein ligase activity” 
(GO:0004842) and “histone H2B ubiquitination” (GO:0033523), 
in this case), while the further histone modifications are only anno- 
tated to the proteins mediating these modifications. 

A similar approach is taken for cases where the experimental 
readout is also a GO term. Examples of this include DNA fragmen- 
tation assays to measure apoptosis, and MAPK cascade to measure 
the activation of an upstream pathway. Proteins that are involved in 
signaling leading to apoptosis do not mediate or participate in 
DNA fragmentation, but their addition or removal causes changes 
in the amount of DNA fragmentation upon apoptotic stimulation. 
In other words, the effect of a protein on a specific readout can be 
very indirect. Whenever possible, annotation of these very specific 
terms (“apoptotic DNA fragmentation” (GO:0006309), “MAPK 
cascade” (GO:0000165)) is limited to cases where there is evidence 
of a molecular function supporting a direct implication in the pro- 
cess. If that information is not available, the annotation is made to 
a more general term, such as “apoptotic process” (GO:0006915) 
or “intracellular signal transduction” (GO:0035556), for instance. 


One common method to determine the function or process of a 
gene is mutagenesis. However, interpreting the results from mutant 
phenotypes is very difficult, as the effects caused by the absence or 
disruption of a gene can be very indirect. Any kind of knockout/ 
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25 Main Functions 
and Secondary Roles 


knockdown or “add back” experiments (in which proteins are 
either overexpressed or added to a cellular extract) cannot demon- 
strate the participation of a protein in a process, only its require- 
ment for the process to occur. Inferring a participatory role would 
be an over-interpretation of the results. A striking illustration of 
this can be made with housekeeping genes, such as those involved 
in transcription and translation: knockouts in these proteins (when 
not lethal) can be pleiotropic and affect essentially all cellular pro- 
cesses. It would be both inaccurate and overwhelming for curators 
to annotate these gene products to every cellular process impacted. 
The more prior knowledge we have about a protein’s function, in 
particular its biochemical activity, the more accurate we can be 
when interpreting a phenotype. 

Phenotypes caused by gene mutations are of great interest, not 
only to try to understand the function of proteins, but also to pro- 
vide insights into mechanisms leading to disease. The scope of the 
GO, though, is to capture the normal function of proteins. There 
are phenotype ontologies for human—HPO [13], mouse—MP 
[14] and other species that allow capturing phenotype in a struc- 
ture that is more relevant to this type of data. 


One limitation of the GO is that main functions and secondary 
roles are not explicitly encoded, so that this information is difficult 
to find. For example, enzymes may have different substrates: in 
some cases, the substrate specificity is driven by the biological con- 
text, but in other cases by the experimental conditions. While some 
activities represent the main function of the enzyme, others are 
secondary or can be limited to very specific conditions. 

A good example is provided by the CYP4F2 enzyme 
(UniProtKB Q9UIUS), a member of the cytochrome P450 family 
that oxidizes a variety of structurally unrelated compounds, 
including steroids, fatty acids, and xenobiotics. In vivo, the enzyme 
plays a key role in vitamin K catabolism by mediating omega- 
hydroxylation of vitamin KI (phylloquinone), and menaquinone-4 
(MK-4), a form of vitamin K2 [15, 16]. While hydroxylation of 
phylloquinone and MK-4 probably constitutes the main activity of 
this enzyme since this activity has been confirmed by several in vivo 
assays, CYP4F2 also shows activity towards other related sub- 
strates, such as arachidonic acid omega and leukotriene-B [10] 
omega [17-21 ]. Clearly vitamin K1 and MK-4 are the main physi- 
ological substrates of CYP4F2, but since it is plausible that the 
enzyme also acts on other molecules, these different activities are 
also annotated. In the absence of additional evidence, it is cur- 
rently impossible to highlight which GO term describes the in vivo 
function of the enzyme. For the reactions known to be implicated 
in vitamin K catabolism, adding this information as an annotation 
extension helps clarify the main role of that specific reaction (see 


Chap. 17, [6]). 


2.6 Hindsight Is 
20/20: Dealing 
with Evolving 
Knowledge 
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Our understanding of biology is dynamic, and evolves as new 
experiments confirm or contradict previous results. It is therefore 
essential to read several, preferably recent publications on a subject 
to make sure that prior working hypotheses, that have subsequently 
been invalidated, are not annotated. That is, sometimes it is neces- 
sary to remove annotations in order to limit the number of false 
positives. A number of mechanisms exist in GO to capture evolu- 
tion of knowledge. New GO terms are added to the ontology 
when knowledge is not covered by existing GO terms. Curators 
work in collaboration with the GO editors, defining new terms or 
correcting the definitions of existing terms when required. 
Conflicting results can be dealt by using the “NOT” qualifier, 
which states that a gene product is not associated with a GO term. 
This qualifier is used when a positive association to this term could 
otherwise be expected from previous literature or automated 
methods (for more information read www.geneontology.org/ 
GO.annotation.conventions.shtml#not). 

A good example of how GO deals with evolving knowledge as 
new papers are published on a protein is provided by the recent 
characterization of the NOTUM protein in human and Drosophila 
melanogaster. Notum was first characterized in D. melanogaster 
(UniProtKB Q9VUX3) as an inhibitor of Wnt signaling [22, 23]. 
Based on its sequence similarity with pectin acetylesterase family 
members, it was initially thought to hydrolyze glycosaminoglycan 
(GAG) chains of glypicans by mediating cleavage of their GPI 
anchor in vitro [24]. Two different articles published recently con- 
tradict these previous results, showing that the substrate of human 
NOTUM (UniProtKB Q6P988) and D. melanogaster Notum is 
not glypicans, and that human NOTUM specifically mediates a 
palmitoleic acid modification on WNT proteins [25, 26]. This new 
data confirms the role of NOTUM as an inhibitor of Wnt signaling, 
but with a mechanism completely different from what the initial 
studies had suggested. To correctly capture these findings in GO, 
new terms describing protein depalmitoleylation were added in 
GO: *palmitoleyl hydrolase activity” (GO:1990699) and “protein 
depalmitoleylation” (GO:1990697). In addition, NOTUM pro- 
teins received negative annotations for *GPI anchor release” 
(GO:0006507) and “phospholipase activity” (GO:0004620) to 
indicate that these findings had been disproven. 

Although relatively infrequent, this type of situation is critical 
because it may affect the accuracy of the GO. Ideally, when new 
findings invalidate previous ones, old annotations are revisited in 
the light of new knowledge and annotation from previous papers 
reevaluated to ensure that annotation was not the result of over- 
interpretation of data. 

The most widely used manual protein annotation editor 
for GO, Protein2GO, has a mechanism to dispute questionable 
or outdated annotations that sends a request for reevaluation 
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of annotations [27]. Users who notice incorrect or missing 
annotations are strongly encouraged to notify the GO helpdesk 
(http: //geneontology.org/form/contact-go) so that corrections 
can be made. 


3 Importance of Annotation Consistency: Toward a Quality Control Approach 


The goal of the GO project is to provide a uniform schema to 
describe biological processes mediated by gene products in all cel- 
lular organisms [2]. Annotation involves translating conclusions 
from biological experiments into this schema, such that we are 
making inferences of inferences. To avoid deriving too much from 
the biologically relevant conclusions of experiments, consistent 
annotation within the GO framework is essential. 

The GO curators make every effort to ensure that annotations 
reflect the current state of knowledge. As new findings are made 
that invalidate or refine existing models there is a need for course 
correction; otherwise both the ontology and the annotations 
may drift. 

Over 20 groups contribute to manual annotations to the GO 
project (http://geneontology.org/page/download-annotations). 
The number of annotations by species, broken down into experi- 
mental versus non-experimental, is shown in Fig. 3. Since manual 
annotations are so critical to the overall quality of the entire corpus 
of GO data, it is important that each biocurator from every con- 
tributing group interprets experiments consistently. 


Annotations by Species 


ummmm experimental 
450000 Fr mmmmm non-experimental 4 


Annotations (non-exp and exp) 


Species 


Fig. 3 Number of annotations in 12 species annotated by the GO consortium. Source: http://geneontology.org/ 
page/current-go-statistics 


4 Summary 
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While the GO Consortium does not possess sufficient resources 


to review all annotations individually on an ongoing basis, several 
approaches are in place to ensure consistency: 


GO uses automated procedures for validating GO annotations. 
An automated checker runs through the GO annotation 
rulebase (http://geneontology.org/page /annotation-quality- 
control-checks), which validates the syntactic and biological 
content of the annotation database, and verifies that correct 
procedures are followed. Examples include taxon checks [28 ] 
and checks to ensure that the correct object type is used with 
different types of evidence. 


The annotation team of the GO consortium also has regular 
annotation consistency exercises, where participating annota- 
tors independently annotate the same paper to ensure that 
guidelines are applied in a uniform manner, discuss any dis- 
crepancy, and update guidelines when these are lacking or need 
clarification. 


Finally, the Reference Genome Project [29] has proven to be a 
very useful resource to improve annotation coherence across 
the GO (Feuermann et al., in preparation). The project uses 
PAINT, a Phylogenetic Annotation and INference Tool, to 
annotate protein families from the PantherDB resource [30]. 
PAINT integrates phylogenetic trees, multiple sequence align- 
ments, experimental GO annotations, as well as references 
pointing to the original data. PAINT curators select the high- 
confidence data that can be propagated across either the entire 
tree or specific clades. By displaying different GO annotations 
for all members of a family, PAINT makes it easy to detect 
inconsistencies, thus improving the overall quality of the set of 
GO annotations. It also gives a mean of identifying consistent 
biases that usually indicate a problem in the ontology or in the 
annotation guidelines. 


Expert curation of GO terms based on experimental data is a com- 
plex process that requires a number of skills from biocurators. In 
this chapter, we describe a number of guidelines to warn curators 
on common annotation mistakes and provide clues on how to 
avoid them. These simple rules, summarized in Table 2, can be 
used as a checklist to ensure that GO annotations are in line with 
GO consortium guidelines. 
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Table 2 
Summary of annotation guidelines 


Carefully select publications. 
Only annotate papers that provide the most added value. 


Read recent publications. 
Research is not a straightforward process and reading recent publications helps resolving conflicts 
and detecting experimental discrepancies. 


Check annotation consistency. 
Review the existing annotations for related proteins to see whether the annotations you are adding are 
consistent. 


Look for confirmation for unusual findings with multiple papers, if possible. 
Avoid entering annotations based on experiments that do not directly implicate the protein with the 
GO term you annotate. 


Annotate the conclusion of the experiment. 
Keep in mind that this may be different from the results presented. Be especially careful of interpreting 
the function of proteins based on mutant phenotypes. 


Remove obsolete annotations. 
If you encounter an annotation that is based on an interpretation of an experiment that is no longer 
valid, use the Challenge mechanism or GO helpdesk to ask to have the annotation removed. 


5 Perspective 


The guidelines presented here are easy to follow and reinforce cura- 
tion quality without reducing curation efficiency, which is a serious 
and valid challenge in the era of big data. In view of the amount of 
data to be dealt with, it has often been argued that manual curation 
“just doesn’t scale,” and an ongoing search for alternative methods 
is under way in the world of biocuration and bioinformatics. 
However, examples described in this chapter show that most pub- 
lications describe complex knowledge that cannot be captured by 
machine learning or text mining technologies. To continue having 
an acceptable throughput, manual curation should be able to cope 
with the increasing corpus of scientific data. From this perspective, 
PAINT constitutes an excellent example of a propagation tool 
based on experimental GO annotations, which ensures maximum 
consistency and efficiency without compromising the quality of the 
annotations produced. Such system provides one possible answer 
to the concerns addressed on scalability of expert curation. 
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Chapter 5 


Computational Methods for Annotation Transfers 
from Sequence 


Domenico Cozzetto and David T. Jones 


Abstract 


Surveys of public sequence resources show that experimentally supported functional information is still 
completely missing for a considerable fraction of known proteins and is clearly incomplete for an even 
larger portion. Bioinformatics methods have long made use of very diverse data sources alone or in com- 
bination to predict protein function, with the understanding that different data types help elucidate com- 
plementary biological roles. This chapter focuses on methods accepting amino acid sequences as input and 
producing GO term assignments directly as outputs; the relevant biological and computational concepts 
are presented along with the advantages and limitations of individual approaches. 


Key words Protein function prediction, Homology-based annotation transfers, Phylogenomics, 
Multi-domain architecture, De novo function prediction 


1 Introduction 


For decades experimentalists have been painstakingly probing a 
range of functional aspects of individual proteins. This steady but 
slow acquisition of functional data is in stark contrast to the results 
of next-generation sequencing technologies, which can survey 
gene expression regulation, genomic organization, and variation 
on a large scale [1]. Similarly, parallel efforts aim to map the net- 
works of interactions between proteins, nucleic acids, and metabo- 
lites that regulate biological processes [2—4]. Nonetheless, 
comprehensive studies of protein function are hindered, because 
the combinations of gene products, biological roles, and cellular 
conditions are too numerous and because many experimental pro- 
tocols cannot be applied to all proteins. Furthermore, the results 
need to be critically interpreted, integrated with existing knowl- 
edge, and translated into machine-readable formats—such as Gene 
Ontology (GO) [5] terms—for further analyses. 
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Manual curation requires substantial time and effort too; 
therefore the exponential growth in the number of sequences in 
UniProtKB [6] has only been matched by a linear increase in the 
number of entries with experimentally supported GO terms. 
Moreover, only 0.03 % of the sequences have received annotations 
for all three GO domains and the level of annotation detail can also 
fall far short of the maximum possible—e.g., there is direct evi- 
dence that some E. col K12 proteins act as transferases with no 
additional information about the chemical group relocated from 
the donor to the acceptor. Automated protein function prediction 
has consequently represented the only viable way to bridge some 
of these gaps, and indeed UniProtKB already exploits some com- 
putational tools (Fig. 1). 

Given the lack of a general theory which can link protein 
sequences and environmental conditions directly to biological 
functions from physicochemical properties, current methods for 
protein function prediction implement knowledge-based heuristics 
that transfer functional information from already annotated pro- 
teins to unannotated ones. This chapter reviews sequence-based 
approaches to GO term prediction, which are the most popular, 
well understood, and easily accessible to a wide range of users. The 
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Fig. 1 Function annotation coverage of proteins in UniprotKB. (a) Over the past decade, the number of amino 
acid chains deposited in UniProtKB has grown exponentially (black line), while those with experimentally sup- 
ported GO term assignments has only increased linearly (green line). This core subset however has allowed to 
assign GO terms to a substantial fraction of sequences (orange line). (b) Even with electronically inferred 
annotations, more than 80 % of sequences in UniProtKB release 2015 01 lack assignments for at least one of 
the molecular function, biological process, or cellular component GO sub-ontologies. Plots and statistics are 
based on the first release of each year 
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focus is primarily on the underpinning concepts and assumptions, 
as well as on the known advantages and pitfalls, which are all 
applicable to other controlled vocabularies, such as those described 
in the Chap. 19 [7] “KEGG, EC and other sources of functional 
data”. How well current function prediction methods perform and 
how prediction accuracy can be measured are topics extensively 
covered in the Chap. 8 [8] “Evaluating GO annotations”, Chap. 9 
[9] “Evaluating functional annotations in enzymes”, and Chap. 10 
[10] “Community Assessment”. 


2 Annotation Transfers from Homologous Proteins 


The most common way to annotate uncharacterized proteins con- 
sists in finding homologues—that is, proteins sharing common 
ancestry—of known function, and inheriting the information avail- 
able for them under the assumption that function is evolutionarily 
conserved. BLAST [11] or PSI-BLAST [12] are routinely used to 
search for homologous sequences, and tools that compare 
sequences against hidden Markov models (HMMs), or pairs of 
profiles or of HMMs can be useful to extend the coverage of the 
protein sequence universe thanks to the increased sensitivity for 
remote homologues. A detailed presentation of sequence compari- 
son methods is beyond the scope of this chapter and is available 
elsewhere [13]. In the simplest case, transfers can be made from 
the sequence with experimentally validated annotations and the 
lowest E-value—and this represents a useful baseline to benchmark 
the effectiveness of more advanced methods. This approach can 
produce erroneous results when key functional residues are 
mutated, or when the alignment doesn’t span the whole length of 
the proteins—possibly indicating changes in domain architecture 
[14]. Iterative transfers of computationally generated functional 
assignments can lead to uncontrolled propagation of such errors; 
the average error rate of molecular function annotations is esti- 
mated to approach 0% only in the manually curated UniProtKB / 
SwissProt database, while it is substantially higher in un-reviewed 
resources [15]. 

Several studies have consequently attempted to estimate 
sequence similarity thresholds that would generate predictions 
with a guaranteed level of accuracy, and have suggested that 80% 
global sequence identity should be generally sufficient for safe 
annotation transfers [16-20]. However, this rule of thumb can 
either be too stringent or too lax, because biological sequences 
evolve at differing rates due to the need to maintain physiological 
function on the one hand, and to avoid deregulated gene expres- 
sion, protein translation, folding, or physical interactions on the 
other [21]. Ideally, these cutoff values should be specific to 
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individual families or even functional categories, but usually the 
number of labelled examples is not sufficient to allow reliable cali- 
bration. To circumvent these issues, it is possible to trade annota- 
tion specificity for accuracy, because broad functional aspects—e.g., 
about ligand binding and enzymatic or transporter activities— 
diverge at lower rates than the fine details—such as the specific 
metal ions bound or the molecules and chemical groups that are 
recognized and processed. 

GOtcha [22] was the first tool to make predictions represent- 
ing the enrichment of the GO terms assigned to BLAST hits in the 
hierarchical context of GO. It first calculates weights for each GO 
term, taking into account the number of similar sequences anno- 
tated with it and the statistical significance of the observed similari- 
ties. The program then considers the semantic relationships among 
the terms to update the tallies and reflect increasing confidence in 
more general annotations. PFP [23] follows a similar approach, 
but targets more difficult annotation cases, too, by leveraging 
information from PSI-BLAST hits with unconventionally high 
E-values. Furthermore, the scoring scheme exploits data about the 
co-occurrence of GO term pairs in UniProtKB entries, which 
allows safer annotations to be produced. Other methods fall in this 
category too, and interested readers are referred to the primary 
literature [24—27 ]. More sophisticated approaches rely on machine 
learning [28] rather than statistical analyses, and use experimental 
data to train classifiers that predict GO terms based on an array of 
alignment-derived features—such as sequence similarity scores, 
E-values, the coverage of the sequences, or the scores that GOtcha 
calculates for each GO category [29-31]. 


3 Annotation Transfers from Orthologous Proteins 


Simple homology-based predictors are quick but error prone 
because they don't try to distinguish functionally equivalent rela- 
tives from those that have functionally diverged. In phylogenetic 
terms, this problem can be cast as classifying orthologues—homo- 
logue pairs evolved after speciation—and paralogues—homologue 
pairs derived from gene duplication. It is widely accepted that 
duplicated genes lack selective pressure to maintain their original 
biological roles, so they can easily undergo nucleotide changes ulti- 
mately leading to functional divergence [32]. The realization that 
genetic diversity arises from gene losses and horizontal transfers, 
too, makes phylogenetic reconstruction even more complex. 

In this setup, annotations can be transferred with varying levels 
of confidence depending on how many orthologues there are and 
how closely related they are. This can partly account for the 
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observation that orthologues can diverge functionally, particularly 
over long evolutionary distances or after duplication events in at 
least one of the lineages [33]. However, experimental studies have 
also shown that paralogues can retain functional equivalence, even 
long after the duplication event [34, 35]. Recent studies have con- 
sequently tested how useful the distinction between orthologues 
and paralogues is for protein function prediction and have drawn 
different conclusions [36-39]. The latest findings suggest that the 
functional similarity between orthologues is slightly higher than 
that between paralogues at the same level of sequence divergence, 
and that the signal is stronger for cellular components than for 
biological processes or molecular functions [38]. 

The traditional approach to orthologue detection involves 
computationally intensive calculations to build phylogenetic trees 
and then identify gene duplication and loss events [40]. SIFTER 
[41] builds on this framework to transfer the most specific experi- 
mentally supported molecular function terms available from the 
annotated sequences to all nodes in the tree using a Bayesian 
approach. The propagation algorithm captures the notion that 
functional transitions are more likely to occur after duplication 
than after speciation events, and when the terms are similar—i.e., 
the corresponding nodes are close in the GO graph. In order to 
speed up the computation, the authors have recently suggested 
limiting the number of GO term annotations that can be assigned 
to each protein [42], and they are providing pre-calculated pre- 
dictions for a vast set of sequences from different species, includ- 
ing multi-domain proteins [43 ]. The semiautomated Phylogenetic 
Annotation and Inference Tool (PAINT) [44] recently adopted 
by the GO consortium provides a more flexible framework, which 
tries to keep functional change events uncoupled, so that the gain 
of one function does not imply the loss of another and vice versa— 
a desirable feature for annotating biological processes and for 
dealing with multifunctional proteins in general. Furthermore, 
unlike SIFTER, PAINT makes no assumption about how func- 
tion diverges over evolutionary distance and whether its conserva- 
tion is higher within orthologous groups than between them. 

The increasing availability of completely sequenced genomes 
has promoted the development of alternative algorithms for ortho- 
logue detection. These first categorize pairs of orthologues in any 
two species, and then cluster the results across organisms, which 
helps recognize and fix spurious assignments [40]. The results are 
usually made publicly available in the form of specialized databases 
such as EggNOG [45], Ensembl Compara [46], Inparanoid [47 ], 
PANTHER [48], PhylomeDB [49], and OMA [50], and the clus- 
tering results provide the basis for GO term annotation transfers, 
under the assumption that the members of an orthologous group 
are functionally equivalent. 
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4 Annotation Transfers from Protein Families 


Even when the sequence similarities between proteins of interest 
and those that have previously been characterized are limited to 
specific sites, such as individual domains or motifs, they can still be 
useful for function prediction. Some biological activities such as 
molecular recognition, protein targeting, and pathway regulation 
have long been mechanistically linked to short linear motifs— 
stretches of 10-20 consecutive amino acids exposed on protein 
surfaces [51]. Furthermore, some well-known protein families can 
be described by specific arrangements of multiple, possibly discon- 
tinuous, linear motifs, or by more general models of their domain 
sequences, namely sequence profiles [52] or hidden Markov mod- 
els [53]. Many public databases now give access to groups of evo- 
lutionarily related proteins, coding for individual domains or 
multi-domain architectures. Even though these resources cannot 
directly assign GO terms to the input amino acid sequences, they 
can produce valuable assignments to know protein families. 

InterPro [54] collates such results from 11 specialized and 
complementary resources, which differ by the types of patterns 
used for family assignment, by the amount of manual curation of 
their contents, and by the use of additional data such as 3-D struc- 
ture or phylogenetic trees. InterPro entries combine available data 
and organize them in a hierarchical way, which mirrors the biologi- 
cal relationships between families and subfamilies of proteins. The 
curators also enrich these annotations with supporting biological 
information from the scientific literature and with links to external 
resources such as the PDB [55] and GO. InterPro provides func- 
tion predictions for the input sequences based on the InterPro2GO 
mapping, which links each protein domain family to the most spe- 
cific GO terms that apply to all its members [56]. These annota- 
tions form a large bulk of the electronically inferred functional 
assignments in UniProtKB, where they are integrated with associa- 
tions generated from other controlled vocabularies, e.g., about 
subcellular localization and enzymatic activity. 

CATH-Gene3D [57] and SUPERFAMILY [58] are two data- 
bases that store domain assignments for known protein sequences 
based on the CATH [59] and SCOP [60] protein structure clas- 
sification schemes, respectively. CATH-Gene3D data are clustered 
into functional families which include relatives with highly similar 
sequences, structures, and functions, as to highlight the strong 
conservation of important regions such as specificity-determining 
residues. GO terms are associated probabilistically to each func- 
tional family based on how often they occur in the UniProtKB 
annotations of the whole sequences. The recent CATH 
FunFHMMer web server automates the search procedure for input 
sequences, resolves multi-domain architectures, assigns each pre- 
dicted domain to its functional family, and finally inherits the GO 
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term annotations found in the library [61]. The deGO—short for 
domain centric—method follows a similar route, but with some 
key differences [62]. HMM models are built for both individual 
domains and supra-domains, i.e., sets of consecutive domains that 
are defined according to the SCOP structural definition and the 
evolutionary one in Pfam [63]. Given the annotations in the GOA 
database [64] and the GO hierarchical structure, each domain and 
supra-domain is labelled with a set of GO terms that are associated 
with it in a statistically significant way. The strength of each asso- 
ciation is then empirically converted into a confidence score. To 
facilitate the analysis of the results by non-specialists, the predicted 
GO terms are divided into four classes according to how specific 
and informative they are using their information content. 


5 De Novo Function Annotation Using Biological Features 


The function annotation methods described so far make use of 
homology to transfer GO terms to a target protein from other 
previously characterized proteins. In some cases, however, no use- 
ful functional annotations can be found for any of the detectable 
homologues, orin the most extreme case no homologous sequences 
can be found at all. In this case a de novo method is required which 
can infer GO terms directly from amino acid sequence in the 
absence of evolutionary relatedness. This is a very hard problem, 
and only a few tools have been developed which can handle these 
situations. The most successful approaches to date employ the 
basic idea of first transforming the target sequence into a set of 
component features. These features are then related to particular 
broad functional classes by means of supervised machine learning 
techniques. In this way the methods address the question of what 
kinds of functions can proteins perform with the given set of pro- 
tein features. As a trivial example, proteins which are predicted to 
have particular numbers of transmembrane helices as component 
features will be more likely to have transmembrane transporter 
activity. 

ProtFun, which makes use of neural networks, was the first 
widely used method for transferring functional annotations between 
human proteins through similarity of biochemical attributes, such 
as the occurrence of charged amino acids, low-complexity regions, 
signal peptides, trans-membrane helices, and posttranslationally 
modified residues [65, 66]. In the original ProtFun method, only 
the broad functional classes originally compiled by Monica Riley 
[67] were considered, but later the authors extended their approach 
to predicting a representative set of GO terms. FFPred, which is 
based on support vector machines, has taken this approach further 
by considering the observed strong correlation between the lengths 
and positions of intrinsically disordered protein regions with certain 
molecular functions and biological processes [68, 69]. As with 
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ProtFun, FFPred was initially developed specifically for annotating 
human proteins, but the results have been shown to extend reason- 
ably well to other vertebrate proteomes too. 

Feature-based protein function assignment offers both advan- 
tages and disadvantages over sequence similarity-based approaches. 
The main advantage is fairly obvious: feature-based methods can 
work in the absence of homology to characterized proteins, and 
thus can even be used to assign GO terms to orphan proteins. A 
further advantage is that feature-based prediction is also able to 
provide insight into functional changes that occur after alternative 
splicing, as the input features are likely to reflect sequence dele- 
tions relative to the main transcript, e.g., the loss of a signal pep- 
tide or disordered region. Probably the main disadvantage is that 
classification models can only be built for GO terms where there 
are sufficient examples with experimentally validated assignments. 
This generally means that assignments can only be made for terms 
fairly high up in the overall GO graph, and thus highly specific 
predictions are generally not possible using this kind of approach. 
Of course, as datasets become larger, these methods will be able to 
overcome such limitation. 


6 Conclusions and Outlook 


The widening gap between the number of known sequences and 
those experimentally characterized has stimulated the development 
and refinement of a wide array of computational methods for pro- 
tein function prediction. The scope of this survey has been limited 
to four classes of sequence-based approaches for GO term annota- 
tion transfers, but several other routes could be followed. If the 
3-D structure of a protein has been solved or accurately modelled, 
it is possible to search for global or local structural similarities and 
predict binding regions and catalytic sites [70, 71]. Comparison of 
multiple complete genomes can help detect not only orthologous 
genes as described above, but also further patterns indicative of 
functional linkages between gene pairs such as fusion events, con- 
served chromosomal proximity, and co-occurrence/absence in a 
group of species [72]. Phylogenetic profiling posits that coevolv- 
ing protein families are functionally coupled, e.g., because they 
encode for proteins assembling into obligate complexes or partici- 
pating in the same biological process. Since its inception, this 
“guilt-by-association” method has been implemented in several 
different ways [73], and tools able to make GO term assignments 
are also emerging [74]. Involvement in the same biological process 
or co-localization can also be inferred from the analysis of protein- 
protein interaction maps, gene expression profiles, and phenotypic 
variations following engineered genetic mutations [75]. Finally, 
integrative strategies combine all such heterogeneous data sources 
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Chapter 6 


Text Mining to Support Gene Ontology Curation 
and Vice Versa 


Patrick Ruch 


Abstract 


In this chapter, we explain how text mining can support the curation of molecular biology databases deal- 
ing with protein functions. We also show how curated data can play a disruptive role in the developments 
of text mining methods. We review a decade of efforts to improve the automatic assignment of Gene 
Ontology (GO) descriptors, the reference ontology for the characterization of genes and gene products. 
To illustrate the high potential of this approach, we compare the performances of an automatic text cate- 
gorizer and show a large improvement of +225 % in both precision and recall on benchmarked data. We 
argue that automatic text categorization functions can ultimately be embedded into a Question-Answering 
(QA) system to answer questions related to protein functions. Because GO descriptors can be relatively 
long and specific, traditional QA systems cannot answer such questions. A new type of QA system, so- 
called Deep QA which uses machine learning methods trained with curated contents, is thus emerging. 
Finally, future advances of text mining instruments are directly dependent on the availability of high- 
quality annotated contents at every curation step. Databases workflows must start recording explicitly all 
the data they curate and ideally also some of the data they do not curate. 


Key words Automatic text categorization, Gene ontology, Data curation, Databases, Data steward- 
ship, Information storage and retrieval 


1 Introduction 


This chapter attempts to concisely describes the role played by text 
mining in literature-based curation tasks concerned with the descrip- 
tion of protein functions. More specifically, the chapter explores the 
relationships between the Gene Ontology (GO) and Text Mining. 
Subheading 2 introduces the reader to basic concepts of text 
mining applied to biology. For a more general introduction, the 
reader may refer to a recent review paper by Zheng et al. [1]. 
Subheading 3 presents the text mining methods developed to 
support the assignment of GO descriptors to a gene or a gene prod- 
uct based on the content of some published articles. The section 
also introduces the methodological framework needed to assess the 
performances of these systems called automatic text categorizers. 


Christophe Dessimoz and Nives Skunca (eds.), The Gene Ontology Handbook, Methods in Molecular Biology, vol. 1446, 
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2 State of the Art 


24 Curation Tasks 


Subheading 4 presents the evolution of results obtained today 
by GOCat, a GO categorizer, which participated in several 
BioCreative campaigns. 

Finally, Subheading 5 discusses an inverted perspective and 
shows how GO categorization systems are foundational of a new 
type of text mining applications, so-called Deep Question- 
Answering (QA). Given a question, Deep QA engines are able to 
find answers, which are literally found in no corpus. 

Subheading 6 concludes and emphasizes the responsibility of 
national and international research infrastructures, in establishing 
virtuous relationships between text mining services and curated 
databases. 


This section presents the state of the art in text mining from the 
point of view of a biocurator, i.e., a person who is maintaining the 
knowledge stored in gene and protein databases. 


In modern molecular biology databases, such as UniProt [2], the 
content is authored by biologists called biocurators. The work per- 
formed by these biologists when they curate a gene or a gene prod- 
uct encompasses a relatively complex set of individual and 
collaborative tasks [3]. We can separate these tasks into two sub- 
sets: sequence annotation—any information added to the sequence 
such as the existence of isoforms—and functional annotation—any 
information about the role of the gene or gene product in a given 
pathway or phenotype. Such a separation is partially artificial 
because a functional annotation can also establish a relationship 
between the role of a protein and some sequence positions but it is 
didactically convenient to adopt such a view. 

The primary source of knowledge for genomics and proteomics 
is the research literature. In the context of biocuration, text mining 
can be defined as a process aimed at supporting biocurators when 
they search, read, identify entities, and store the resulting struc- 
tured knowledge. The developments of benchmarks and metrics to 
evaluate how automatic text mining systems can help performing 
these tasks are thus crucial. 

BioCreative is a community initiative to periodically evaluates 
the advances in text mining for biology and biocuration.! The 
forum explored a wide span of tasks with emphasis on named-entity 
recognition. Named-entity recognition covers a large set of meth- 
ods that seek to locate and classify textual elements into predefined 
categories such as the names of persons, organizations, locations, 
genes, diseases, chemical compounds, etc. Thus, querying PubMed 
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Table 1 
Comparative curation steps supported by text mining 


[4] [5] 


Retrieval Collection 
Selection Triage 
Reading /Passage retrieval 


Entity extraction Entity indexing 


1 

2 

3 

4 

5 Entity normalization 

6 Relationship + evidence annotation 
7 Extraction of evidences, e.g., images 
8 Feed-back 


9 Check of records 


Reference [4] describes the curation task as an iterative process (#8 Feed-back) whereas 
[6] describes it as a linear process (ending with #9 Check of records). Both descriptions 
are however consistent. Thus, it is possible to align steps #1, #2, and #4 in Table 1. Step 
#6 is optional in [4] as the process is regarded as an iterative process. This step is an 
“intelligent” follow up of the curation task, where already annotated functions/proper- 
ties should receive less priority in the next Retrieval step. In contrast, steps #3 “Reading/ 
passage retrieval” and #6 “Feed-back” is missed by [6], while the “Extraction of evi- 
dences” & “Check of record” is missed by [4] Step #5, i.e., the assignment of unique 
identifiers to descriptors, in [4] is implicit in step #4 of [6] 


with the keywords “biocreative” and “information retrieval” returns 
8 PMIDs, whereas 32 PMIDs are returned for the keywords “bio- 
creative” and “named entity” [18th of November 2015]. 

The general workflow of a curation process supported by text 
mining instruments commonly comprises 6—9 steps as displayed in 
Table 1, which is a synthesis inspired by both [6] and [4]. 

Search is often the first step of a text mining pipeline, although 
information retrieval has received little attention from bioinforma- 
ticians active in Text Mining. Fortunately, information retrieval has 
been explored by other scientific communities and in particular by 
information scientists via the TREC (Text Retrieval Conferences) 
evaluation campaigns, see ref. 7 for a general introduction. From 
2002 to 2015, molecular biology [8], clinical decision-support [9 ] 
and chemistry-related information retrieval [10] challenges have 
been explored by TREC. Interestingly, large-scale information 
retrieval studies have consistently shown that named-entity recog- 
nition has no or little impact on search effectiveness [11, 12]. 


Beyond information retrieval, more elaborated mining instruments 
can then be derived. Thus, search engines, which return docu- 
ments or pointers to documents, are often powered with passage 
retrieval skills [7], i.e., the ability to highlight a particular sentence, 
a few phrases, or even a few keywords in a given context. 
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A3 Named-Entity 
Recognition 


The enriched representation can help the end-user to decide upon 
the relevance of the document. If for MEDLINE records, such 
passage retrieval functionalities are not crucial because an abstract 
is short enough to be rapidly read by a human, passage retrieval 
tools become necessary when the search is performed on a collec- 
tion of full-text articles like for instance in PubMed Central. Within 
a full-text article, the ability to identify the section where a given 
set of keywords can be very useful as matching the relevant key- 
words in a “background” section has a different value than match- 
ing them in a “results” section. The latter is likely to be a new 
statement while the former is likely to be regarded as a well-estab- 
lished knowledge. 


Unlike in other scientific or technical fields (finance, high energy 
physics, etc.), in the biomedical domain, named-entity recognition 
covers a very large set of entities. Such a richness is well expressed 
by the content of modern biological databases. Text Mining stud- 
ies have been published for many of those curation needs, includ- 
ing sequence curation and identification of polymorphisms [13], 
posttranslational modifications [14], interactions with gene prod- 
ucts or metabolites [15], etc. In this context, most studies 
attempted to develop instruments likely to address a particular set 
of annotation dimensions, serving the needs of a particular molec- 
ular biology database. The focus in such studies is often to design 
a Graphic User Interfaces and to simplify the curation work by 
highlighting specific concepts in a dedicated tool [16]. While most 
of these systems seem exploratory studies, some seem deeply inte- 
grated in the curation workflow, as shown by the OntoMate tool 
designed by the Rat Genome Database [17], the STRING DB for 
protein-protein interactions or the BioEditor of neXtProt [18]. 

From an evaluation perspective, the idea is to detect the begin- 
ning and the end of an entity and to assign a semantic type to this 
string. Thus in named-entity recognition, we assume that entity 
components are textually contiguous. Inherited from early corpus 
works on information extraction and computational linguistics 
[19], the goal is to assign a unique semantic category—e.g., Time, 
Location, and Person—to a string in a text [20]. 

Semantic categories are virtually infinite but some entities 
received more attention. Gene, gene products, proteins, species 
[21, 22], and more recently chemical compounds were signifi- 
cantly more studied than for instance organs, tissues, cell types, cell 
anatomy, molecular functions, symptoms, or phenotypes [23 ]. 

The initial works dealing with the recognition of GO entities 
were disappointing (Subheading 3.2), which may explain part of 
the reluctance to address these challenges. We see here one impor- 
tant limitation of named entities: it is easy to detect a one or two 
words terms into a document, while the recognition of a protein 
function does require a “deeper” understanding or combination of 
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biological concepts. Indeed a complex GO concept is likely to 
combine subconcepts belonging to various semantic types, includ- 
ing small molecules, atoms, protein families, as well as biological 
processes, molecular functions, and cell locations. 


In order to compensate for the limitations of named-entity recog- 
nition frameworks, two more complementary approaches have 
been proposed: entity normalization and information (or relation- 
ship) extraction. 

Normalization can be defined as the process by which a unique 
semantic identifier is assigned to the recognized entities [24]. The 
identifiers are available in different resources such as several onto- 
terminologies or knowledge bases. The assignment of unique iden- 
tifiers can be relatively difficult in practice due to a linguistic 
phenomenon called lexical ambiguity. Many strings are lexically 
ambiguous and therefore can receive more than one identifier 
depending on the context (e.g., HIV could be a disease or a virus). 
The difficulty is amplified in cascaded lexical ambiguities. Many 
entities require the extraction of other entities to receive an unam- 
biguous identifier. For instance, the assignment of an accession 
number to a protein may depend on the recognition of an organ- 
ism or a cell line somewhere else in the text. 

Further, the extraction of relationships requires the recognition 
of the specific entities, which can be as various as a location, an 
interaction (binding, coexpression, etc.) [25], an etiology or a tem- 
poral marker (cause, trigger, simultaneity, etc.) [26]. For some 
information extraction tasks such as protein-protein interactions, 
the normalization and relationship extraction may require first the 
proper identification of other entities such as the experimental 
methods (e.g., yeast 2-hybrid) used to generate the prediction. 
Furthermore, additional information items may be provided such as 
the scale of the interaction or the confidence in the interaction [27 ]. 

To identify GO terms, named-entity recognition and informa- 
tion extraction is insufficient due to two main difficulties: first, the 
difficulty of defining all (or most) strings describing a given con- 
cept; second, the difficulty of defining the string boundaries of a 
given concept. The parsing of texts to identify GO functions and 
how they are linked with a given protein demands the develop- 
ment of specific methods. 


Automatic text categorization (ATC) can be defined as the assign- 
ment of any class or category to any text content. The interested 
reader can refer to [28], where the author provides a comprehensive 
introduction to ATC, with a focus on machine learning methods. 
In both ATC and in Information Retrieval, documents are 
regarded as “bag-of-words.” Such a representation is an approxi- 
mation but itis a powerful and productive simplification. From this 
bag, where all entities and relationships are treated as flat and 
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3 Methods 


3.1 Automatic Text 
Categorization 


3.2 Lexical 
Approaches 


independent data, ATC attempts to assign a set of unambiguous 
descriptors. The set of descriptors can be binary as in triage tasks, 
where documents can be either classified as relevant for curation or 
irrelevant, or it can be multiclass. The scale of the problem is one 
parameter of the model. In some situations, ATC systems do not 
need to provide a clear split between relevant and irrelevant cate- 
gories. In particular, when a human is in the loop to control the 
final descriptor assignment step, ATC systems can provide a ranked 
list of descriptors, where each rank expresses the confidence score 
of the ATC system. ATC systems and search engines share here a 
second common point: compared to named-entity recognition, 
which is normally not interactive, ATC and Information Retrieval 
are well suited for human-computer interactions. 


With over 40,000 terms—and many more if we account for syn- 
onyms—assigning a GO descriptor to a protein based on some 
published document is formally known as a large multiclass classi- 
fication problem. 


The two basic approaches to solve the GO assignment problem are 
the following: (1) exploit the lexical similarity between a text and a 
GO term and its synonyms [29]; (2) use some existing database to 
train a classifier likely to infer associations beyond string matching. 
The second approach uses any scalable machine learning tech- 
niques to generate a model trained on the Gene Ontology 
Annotation (GOA) database. Several machine learning strategies 
have been used but the trade-off between effectiveness, efficiency, 
and scalability often converges toward an approach called k-Nearest 
Neighbors (k-NN); see also ref. 30. 


Lexical approaches for ATC exploit the similarities between the 
content of a text and the content of a GO term and its related syn- 
onyms [31 ]. Additional information can be taken into account to 
augment the categorization power such as the definitions of the 
GO terms. The ranking functions take into account the frequency 
of words, their specificity (measured by the “inverse document fre- 
quency,” the inverse of how many documents contain the word), 
as well as various positional information (e.g., word order); see ref. 
32 for a detailed description. 

The task is extremely challenging if we consider that some GO 
terms contain a dozen words, which makes those terms virtually 
unmatchable in any textual repository. The results of the first 
BioCreative competition, which was addressing this challenge, 
were therefore disappointing. The best *high-precision" system 
achieved an 80% precision but this system covered less than 20% of 
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the test sample. In contrast, with a recall close to 80%, the best 
“high-recall” systems were able to obtain an average precision of 
20-30% [33]. At that time, over 10 years ago, such a complex task 
was consequently regarded are practically out ofreach for machines. 


The principle of a k-NN is the following: for an instance X to be 
classified, the system computes a similarity measure between X and 
some annotated instances. In a GO categorizer, an instance is typi- 
cally a PMID annotated with some GO descriptors. Instances on 
the top of the list are assumed “similar” to X. Experimentally, the 
value of k must be determined, where k is the number of similar 
instances (or neighbors), which should be taken into account to 
assign one or several categories to X. 

When considering a full-text article, a particular section in this 
article, or even a MEDLINE record, it is possible to compute a 
distance between this section and similar articles in the GOA data- 
base because in the curated section of GOA, many GO descriptors 
are associated with a PMID—those marked up with an EXP evi- 
dence code [34]. The computation of the distance between two 
arbitrary texts can be more or less complex—starting with count- 
ing how many words they share—and the determination of the k 
parameters can also be dependent on different empirical features 
(number of documents in the collection, average size a document, 
etc.) but the approach is both effective and computationally simple 
[7]. Moreover, the ability to index a priori all the curated instances 
makes possible to compute distances efficiently. 

The effectiveness of such machine learning algorithms is 
directly dependent on the volume of curated data. Surprisingly GO 
categorizers seem not affected by any concept drift, which affects 
database and data-driven approaches in general. Even old data, i.e., 
protein annotated with an early version of the GO, seem useful for 
k-NN approaches [35]. To give a concrete example, consider pro- 
teins curated in 2005 with a version of the Gene Ontology and a 
MEDLINE reports available at that time: it is difficult to under- 
stand why a model containing mainly annotations from 2010 to 
2014 would outperform a model containing data from 2003 to 
2007 using data exactly centered on 2005. While the GO itself has 
been expanded by at least a factor 4 in the past decade, the consis- 
tency of the curation model has remained remarkably stable. 


In Fig. 1, we show an example output of GOCat [35], which is 
maintained by my group at the SIB Swiss Institute of Bioinformatics. 
The same abstract is processed by GOCat using two different types 
of classification methods: a lexical approach and a k-NN. 

In this example, the title of an article ([36]; *Modulation by 
copper of p53 conformation and sequence-specific DNA binding: 
role for Cu(II)/Cu(I) redox mechanism”) is used as input to con- 
trast the behavior of the two approaches: This reference is used in 
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# Score GO ID Name # Score GOID Name 
1 1.00 GO:0003677 DNA binding +/- 1 1.00 GO:0005507 copper ion binding +/- 
sequence-specific DNA 2 0.42 GO:0046688 response to copper ion +/- 
2 0.77 GO:0043565 binding (synonym sequence 5 . inc ion binding +/ 
specific dna binding) +/- 3 0.22 GO:0008270 Ze AOI OLD us 
4 031 GO:0070712 RNA cytidine-uridine insertion 4 021 GO:0003677 DNA binding +/- 
i ; (synonym rna cu insertion) +/- 5 0.19 GO:0004784 superoxide dismutase activity 
DNA conformation change cellular copper ion 
4 0.22 GO:0071103 (synonym dna conformation 6 0.16 GO:0006878 homeostasis al 
modification) +/- . ca 
KM copper ion transmembrane 
5 022 GO:0005488 binding +/- 7 0.13 GO:0035434 
PE transport +/- 
copper-nicotianamine -— unites ca ie 
É 7 .13 : copper ion import +/- 
G 02i apes dor e ae cellular response to copper ion 
chelate transporter activity)+/- 9 0.13 GO:0071280 +/- 
copper-exporting ATPase . 3 copper ion transmembrane 
7 0.21  GO:0004008 activity (synonym cu(2+)- 10 0.12 G0:0005375 transporter activity +/- 
exporting atpase activity) +/- 
8 0.19 — GO:0009455 redox taxis +/- oe ee Fe 
-— Souad] oxidoreductase activity 12 0.11 GO:0016531 copper chaperone activity +/- 
` i (synonym redox activity) +/- 13 0.10 GO:0055114 oxidation-reduction process 
10 0.19 GO:0051776 detection of redox state 14 0.10 GO:0019430 removal of superoxide radicals 
(synonym redox sensing) +/- transition metal ion binding 
11 0.17 — GO:0002039 p53 binding +/- 15 0.10 — GO:0046914 — 
12 0.16 GO:0005507;  COPPer ion binding (synonym 16 0.10 GO:0006825 copper ion transport +/- 
copper binding) +/- 
; ; 17 0.09 GO:0006801 superoxide metabolic process 
spliceosomal conformational 
13 0.15 GO:0000393 changes to generate catalytic 18 0.09 GO:0010273 detoxification of copper ion 
conformation +/- +/- 


Fig. 1 Comparative outputs of lexical vs. k-NN versions of GOCat 


UniProt to support the assignment of the “copper ion binding” 
descriptor to p53. We see that the lexical system (left panel) is able 
to assign the descriptor at rank #12, while the k- NN system (right 
panel) provides the descriptor in position #1. 

Finally, we see how both categorizers are also flexible instru- 
ments as they basically learn to rank a set of a priori categories. 
Such systems can easily be used as fully automatic systems—thus 
taking into account only the top N returned descriptors by setting 
up an empirical threshold score—or as interactive systems able to 
display dozens of descriptors including many irrelevant ones, which 
then can be discarded by the curator. 

Today, GO k-NN categorizers do outperform lexical catego- 
rizers; however, the behavior of the two systems is complementary. 
While the latter is potentially able to assign a GO descriptor, which 
has rarely or never been used to generate an annotation, the former 
is directly dependent on the quantity of [GO; PMID] pairs avail- 
able in GOA. 


3.5 Inter-annotator 
Agreement 


An important parameter when assessing text mining tools is the 
development of a ground truth or gold standard. Thus, typically 
for GO annotation, we assume that the content of curated 
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databases is the absolute reference. This assumption is acceptable 
from a methodological perspective, as text mining systems need 
such benchmarks. However, it is worth observing that two cura- 
tors would not absolutely agree when they assign descriptors, 
which means that a 100% precision is purely theoretical. Thus, 
Camon et al. [37] reports that two GO annotators would have an 
agreement score of about 39-43%. The upper score is achieved 
when we consider that the assignment of a generic concept instead 
of a more specific one (children) is counted as an agreement. 


4 Today's Performances 


Today, GOCat is able to assign a correct descriptor to a given 
MEDLINE record two times out of three using the BioCreative I 
benchmark [35], which makes it useful to support functional anno- 
tation. Another type of systems, can be used to support comple- 
mentary tasks of literature. exploration (GoPubMed: [38]) or 
named-entity recognition [39]. While GOCat attempts to assign 
GO descriptors to any input with the objective to help curating the 
content of the input, GoPubMed provides a set of facets (Gene 
Ontology or Medical Subject Headings) to navigate the result of a 
query submitted to PubMed. 

It is worth observing that GO categorizers work best when 
they assume that the curator is involved in selecting the input 
papers (performing a triage or selection task as described in 
Table 1). Such a setting, inherited from the BioCreative competi- 
tions, [33, 40] is questionable for at least two reasons: (1) Curators 
read full-text articles and not only the abstracts—captions and leg- 
ends seem especially important; (2) The triage task, i.e., the ability 
to select an article as relevant for curation, could mostly be per- 
formed by a machine, provided that fair training data are available. 
In 2013, the campaign of BioCreative, under the responsibility of 
the NCBI, revisited the task [41]. The competitors were provided 
with full-text articles and they were asked not only to return GO 
descriptors but also to select a subset of sentences. The evaluation 
was thus more transparent. A small but high-quality annotated 
sample of full-text papers was provided [42 ]. 

The main results from these experiments are the following; see 
ref. 41 for a complete report describing the competition metrics as 
well as the different systems participating in the challenge. First, the 
precision of categorization systems improved by about +225% 
compared to BioCreative 1. Second, the ability to detect all relevant 
sentences seems less important than being able to select a few high 
content-bearing sentences. Thus GOCat achieved very competitive 
results for both recall and precision in GO assignment task, but 
interestingly the system performed relatively poorly when focusing 
on the recall of the sentence selection task, see Figs. 2 and 3 for 
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Fig. 2 Relative performance of the sentence triage module of GOCat4FT (GOCat for full-text, blue diamond) at 
the official BioCreative IV competition. Courtesy of Zhiyong Lu, National Institute of Health, National Library of 
Medicine 
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Fig. 3 Relative performance of GOCat4FT (blue diamond) when fed with the sentences selected by the three 
sentence triage systems evaluated in Fig. 2 


5 Discussion 


5.1 Information 
Redundancy 

and Curation-Driven 
Data Stewardship 
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comparison. We see that two of the sentence ranking systems devel- 
oped for the BioCreative IV competition (orange dots) outperform 
other systems in precision but not in recall. References [40, 43] 
conclude from these experiments that the content in a full-text arti- 
cle is so (highly) redundant that a weak recall is acceptable provided 
that the few selected sentences have good precision. The few high 
relevance sentences selected by GOCat4FT (Gene Ontology 
Categorizer for Full Text) are sufficient to obtain highly competi- 
tive results when GO descriptors are assigned by GOCat (orange 
dots) regarding both recall and precision as the three official runs 
submitted by SIB Text Mining significantly outperforms other sys- 
tems. Such a redundancy phenomenon is probably found not only 
in full-text contents but more generally in the whole literature. 
Together with GO and GOA, which was used by most partici- 
pants in the competition, some online databases seem particularly 
valuable to help assigning GO descriptors. Thus, Luu et al. [44] 
uses the cross-product databases [45] with some effectiveness. 


Although a fraction of it is likely to be sufficient to obtain the top- 
ranked GO descriptors, the results reported in the previous section 
are obtained by using only 10-20% of the content of an article. 
This suggests that 80-90% of what is published is unnecessary 
from an information-theoretic perspective. 


New and informative statements are rare in general. They are 
moreover buried in a mass of relatively redundant and poorly 
content-bearing claims. It has been shown that the density and 
precision of information in abstracts is higher [5, 46] than in full- 
text reports while the level of redundancy across papers and 
abstracts is probably relatively high as well. 

We understand that the separation of valuable scientific state- 
ments is labor intensive for curators. This filtering effort is compli- 
cated within an article but also between articles at retrieval time. 
We argue that such task could be performed by machines provided 
that high-quality training data are available. The training data 
needed by text mining systems are unfortunately lost during the 
curation process. Indeed, the separation between useful and use- 
less materials (e.g., PMIDs and sentences) is performed—but not 
recorded—by the curator during the annotation process but they 
are unfortunately not stored in databases. 

In some cases, the separation is explicit, in other cases, it is 
implicit but the key point is that a mass of information is definitely 
lost with no possible recovery. The capture of the output of the 
selection process—at least for the positive content but ideally also 
for a fraction of the negative content—is a minimal requirement to 
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5.2 Assigning 
Unmatchable 
GO Descriptors: 
Toward Deep QA 


improve text mining methods. The expected impact of the imple- 
mentation of such simple data stewardship recommendation is 
likely a game changer for text mining far beyond any hypothetical 
technological advances. 


Some GO concepts describe entities which are so specific that they 
can hardly be found anywhere. This has several consequences. 
Traditional QA systems were recently made popular to answer 
Jeopardy-like questions with entities as various as politicians, town, 
plants, countries, songs, etc., see ref. 47. In the biomedical field, 
Bauer and Berleant [48] compare four systems, looking at their 
ergonomics. With a precision in the range of 70-80% [49], these 
systems perform relatively well. However, none of these systems is 
able to answer questions about functional proteomics. Indeed, 
how can a text mining system find an answer if such an answer is 
not likely to be found on Earth in any corpus of book, article, or 
patent? The ability to accurately process questions, such as what 
molecular functions are associated with tp53 requires to supply 
answers, such as “RNA polymerase II transcription regulatory 
region sequence-specific DNA binding transcription factor activity 
involved in positive regulation of transcription” and only GO cat- 
egorizers are likely to automatically generate such an answer. 

We may think that such complex concepts could be made sim- 
pler by splitting the concept into subconcepts, using clinical termi- 
nological resources such as SNOMED CT [50, 51] or ICD-10 
[52], see also Chap. 20 [53]. That might be correct in some rare 
cases but in general, complex systems tend to be more accurately 
described using complex concepts. The post-coordination meth- 
ods explored elsewhere remain effective to perform analytical tasks 
but they make generative tasks very challenging [52]. Post- 
coordination is useful to search a database or a digital library 
because search tasks assume that documents are “bag of words” 
and they ignore the relationships between these words. However, 
other tasks such as QA or curation do require to be able to mean- 
ingfully combine concepts. In this context, the availability of a pre- 
computed list of concepts or controlled vocabulary is extremely 
useful to avoid generating ill-formed entities. 

Answering functional omics questions is truly original: it 
requires the elaboration of a new type of QA engines such as the 
DeepQA4GO engine [54]. For GO-type of answers, DeepQA4GO 
is able to answer the expected GO descriptors about two times out 
of three, compared to one time out of three for traditional systems. 
We propose to call these new emerging systems: Deep QA engines. 
Deep QA, like traditional QA engines are able to screen through 
millions of documents, but since no corpus contain the expected 
answers, Deep QA is needed to exploit curated biological data- 
bases in order to generate useful candidate answers for curators. 


6 Conclusion 
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While the chapter started with introducing the reader to how 
text mining can support database annotation, the conclusion is 
that next generation text mining systems will be supported by 
curated databases. The key challenges have moved from the 
design of text mining systems to the design of text mining sys- 
tems able to capitalize on the availability of curated databases. 
Future advances in text mining to support biocuration and bio- 
medical knowledge discovery are largely in the hands of database 
providers. Databases workflows must start recording explicitly 
all the data they curate and ideally also some of the data they do 
not curate. 

In parallel, the accuracy of text mining system to support GO 
annotation has improved massively from 20 to 65% (4225 96) from 
2005 to 2015. With almost 10,000 queries a month, a tool like 
GOCat is useful in order to provide a basic functional annotation 
of protein with unknown and/or uncurated functions [55] as 
exemplified by the large-scale usage of GOCat by the COMBREX 
database [56, 57]. However, the integration of text mining sup- 
port systems into curation workflows remains challenging. As 
often stated, curation is accurate but does not scale while text 
mining is not accurate but scales. National and international 
Research Infrastructures should play a central role to promote 
optimal data stewardship practices across the databases they sup- 
port. Similarly, innovative curation models should emerge by 
combining the quality and richness of curation workflows, more 
cost-effective crowd-based triage, and the scalability of text min- 
ing instruments [58 ]. 
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Chapter 7 


How Does the Scientific Community Contribute 
to Gene Ontology? 


Ruth C. Lovering 


Abstract 


Collaborations between the scientific community and members of the Gene Ontology (GO) Consortium 
have led to an increase in the number and specificity of GO terms, as well as increasing the number of GO 
annotations. A variety of approaches have been taken to encourage research scientists to contribute to the 
GO, but the success of these approaches has been variable. This chapter reviews both the successes and 
failures of engaging the scientific community in GO development and annotation, as well as, providing 
motivation and advice to encourage individual researchers to contribute to GO. 


Key words Clinical and basic research, Gene Ontology, Proteomics, Transcriptomics, Community, 
Community annotation, Community curation, Genomics, Bioinformatics, Curation, Annotation, 


Biocuration 


1 Introduction 


The overarching vision of the Gene Ontology Consortium (GOC) 
is to describe gene products across species—their temporally and 
spatially characteristic expression and localization, their contribution 
to multicomponent complexes, and their biochemical, physiologi- 
cal, or structural functions—and thus enable biologists to easily 
explore the universe of genomes [1]. In practical terms, this makes 
providing an accessible, navigable resource of gene products, rigor- 
ously described according a structured ontology, the GOC’s key 
objective. The referenced links, between the identifiers for Gene 
Ontology (GO) terms and the identifiers for specific gene products, 
are the elemental GO annotations. 

With Next Generation Sequencing technologies increasing the 
rate at which genomic and transcriptomic data are accumulating, 
the need for highly informative annotation data for the human 
genome is paramount. Community annotation has the potential to 
improve the information provided by the GO resource. 
Consequently, the GOC actively encourages contributions from 
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the scientific community, to ensure that the ontology appropriately 
reflects the current understanding of biology and to supply gene 
product annotations [2—4]. There are many online resources that 
encourage community annotation [5-7]; however, annotations 
created in the majority of these are not submitted to the GO data- 
base. This chapter, therefore, only discusses the progress of com- 
munity contributions to the GO database. 


2 Ontology Development Workshops 


The success of GO is dependent on its ability to represent the 
research communities’ interpretation of biological processes and 
individual gene product functions and cellular locations. This is 
achieved through the use of descriptive GO terms, with detailed 
definitions, and appropriate placement of GO terms within the 
ontology hierarchy. The majority of GO terms are created by GO 
editors, following a review of the current scientific literature, 
often, without the need of discussions with experts in the relevant 
field [8-9 ]. 

Major revisions or expansions of a specific GO domain are usu- 
ally undertaken in consultation with experts working in that bio- 
logical field. Notable successful ontology development projects 
include that of the immune system [10], heart development [2], 
kidney development [11], muscle processes and cellular compo- 
nents [12], cell cycle, and transcription [13]. The expansion of the 
heart development domain provides a good example of how 
experts in the field can guide the GO editors to create very descrip- 
tive terms. The GO heart development domain describes heart 
morphogenesis, the differentiation of specific cardiac cell types, 
and the involvement of signaling pathways in heart development. 
This was achieved following a 1% day meeting with four heart 
development experts, as well as considerable email exchanges both 
before and after the meeting [2]. The result of this effort was an 
increase in the number of GO terms describing heart development 
from 12 to over 280, and the creation of highly expressive terms 
such as secondary heart field specification (GO:0003139) and 
canonical Wnt signaling in cardiac neural crest cell differentiation 
(GO:0061310). 


3 Community Contributions to the GO Annotation Database 


Lincoln Stein suggested that there are four organizational models to 
genome annotation: the factory (reliant on a high degree ofautoma- 
tion), the museum (requiring expert curators), the cottage industry 
(scientists working out of their laboratories), and the party (or 
jamboree—a short intensive annotation workshop) [14]. To this, list 


3.1 GO Annotation 
Within a 
Bioinformatics Course 
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needs to be added “the school,” where people are encouraged to 
annotate as part of a bioinformatics training program. 

Currently, there are two major approaches taken to associate GO 
terms with gene products: manual curation of the literature and auto- 
mated pipelines based on manually created rules (the “factory” ) [15]. 
The majority of manual annotation follows the “museum” model, 
relying on highly trained curators reading the published literature, 
evaluating the experimental evidence, and applying the appropriate 
GO terms to the gene record [8, 16]. The majority of these curators 
are associated with specific model organism databases, such as FlyBase 
[17], PomBase [18] and ZFIN [19], or proteomic databases, such as 
UniProt [20]. In general, these curators will be annotating gene 
products across a whole genome. In contrast, there have been a few 
annotation projects funded to improve the representation of specific 
biological domains, such as cardiovascular [3], kidney [21] and neu- 
rological [22]. Two of these projects are being undertaken by the 
UCL functional annotation team and provide an example of an 
expert curation team embedded within a scientific research group. 


In the “school” model, bioinformatics courses, which include an 
introduction to GO, provide an opportunity for attendees to contrib- 
ute GO annotations. However, providing timely feedback to degree 
students is very labor intensive. Texas A&M University has circum- 
vented this problem through the use of competitive peer review. 
A biannual multinational student competition has been established 
to undertake large-scale manual annotation of gene function using 
GO. In this competition, known as the Community Assessment of 
Community Annotation with Ontologies (CACAO),! teams of stu- 
dents get points for making annotations, but can also take points 
from competitors by correcting their annotations. A professional 
curator then reviews these and annotations that are judged to be cor- 
rect are submitted to the GO database. This highly successful crowd- 
source project uses the online GONUTs wiki [23] to submit 
annotations and has supplied 3700 annotations to the GO database. 
The CACAO attribution identifies the resultant annotations, associ- 
ated with over 2500 proteins. This competition has given over 700 
students the opportunity not only to learn how to use some of the 
essential online biological knowledgebases, but to reinforce this 
knowledge over a 3-month period, connecting their curriculum to 
research applications. An MSc literature review project, at University 
College London (UCL), also provides an opportunity to supply GO 
annotations to the GO database. Four projects, to date, have resulted 
in annotations for proteins involved in autism [24], heart develop- 
ment, folic acid metabolism, and hereditary hemochromatosis, creat- 
ing over 1000 annotations. A limitation of student annotations is that 
they do not draw on the expertise of the scientific community. 
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3.2 Annotation 
Workshops 


3.3 GO Annotation 
by Specific Scientific 
Communities 


For the past 5 years, the UCL functional annotation team has 
run a 2-day introduction to bioinformatics and GO course. This 
course has been attended by over 200 scientists, who have been 
given the opportunity to use the UniProt GO annotation tool, 
Protein2GO [20], to annotate their own papers or those published 
in their field of expertise. However, on average only 50 annotations 
are submitted during the entire course and very few scientists con- 
tinue to contribute annotations after the end of the course. A similar 
problem has been identified in many other annotation workshops. 


The first workshop to submit GO annotations to the GO database 
focused on the annotation of the Drosophila genome [25]. 
Following on from this, the Pathema group ran several annotation- 
training workshops, in 2007, with the idea that trained scientists 
would continue to provide annotation updates thereafter [26]. 
Unfortunately, this approach had limited success. Although 150 
scientists attended, in general they provided guidance to the cura- 
tors, rather than creating annotations themselves. 


One of the most successful community annotation projects is that 
run by PomBase [18]. During pilot projects, PomBase encouraged 
80 scientists from the fission yeast community to submit a variety 
of annotations, including 226 GO annotations,’ using their cura- 
tion tool, CANTO [4]. Following on from this success the 
PomBase team now receives regular annotations from the 
Schizosaccharomyces pombe community. 

Another successful community annotation project has a tran- 
scription focus and was initiated by a group at the Norwegian 
University of Science and Technology. To ensure a consistent anno- 
tation approach is undertaken, the Norwegian research group, with 
members of the GOC, has created a set of transcription factor anno- 
tation guidelines [13]. These provide details of the ideal GO terms 
to associate with a transcription factor, with a list of experimental 
conditions that would support these annotations. By using these 
standardized conventions, the literature-curated data (currently 
including annotations for 400 proteins) is imported directly into 
the GO database, with only minimal quality checking required. 
Working with the GOC, the SYSCILIA consortium may prove to 
be just as effective. This group has already contributed to the devel- 
opment of GO terms to describe ciliary components and processes 
and started to submit GO annotations [27 ]. 

The outstanding contributions of Ralf Stephan, demonstrates 
what can be achieved through dedication.? Stephan singlehandedly 
annotated 60% of the Mycobacterium tuberculosis genome, through 
the review of over 1000 papers. Furthermore, the resultant 7700 


i http://www.pombase.org/community/fission-yeast-community-curation- 
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annotations associated with 2500 proteins were checked by the 
UniProt-GOA team [15] and needed very few edits, before incor- 
poration into the GO database. 

The success of PomBase may reflect the small size of the 
research community and that an early visionary investment has had 
a significant impact on the quality of data available at PomBase, 
achieved through the contributions of individual scientists and 
curators. In contrast, the Norwegian transcription factor project, 
formed to address the deficit of transcription factor annotations 
and in response to a need for comprehensive annotation of these 
proteins. The creation of a comprehensive and detailed annotation 
guide is key to the achievements of this project [13]. However, the 
GO database would also benefit from a few more “cottage indus- 
try” contributions, such as those provided for the Mycobacterium 
tuberculosis genome. 


4 Why Contribute to GO? 


The motivation behind “community annotation” is varied. Some 
scientists are contributing GO annotations purely to ensure their 
research area or gene product(s) of interest are well curated. Others 
may want to ensure data from their own papers is curated and, 
therefore, promoted in popular knowledgebases; potentially 
increasing the citation rate of these papers. Others still are moti- 
vated by peer competition! Regardless of the motivation, the GOC 
is always appreciative of input from the scientific community. 
Despite the success of some community annotation projects, taken 
as a whole, very few scientists suggest annotations, or papers for 
annotation. Consequently, the GOC continues to search for new 
ways to encourage the research community to contribute to cura- 
tion activities. For example, the inclusion of data from gene wikis 
[5-7] could help take community annotation forwards. Considerable 
funding is being invested in NGS, proteomic and transcriptomic 
technologies and sequencing of population genomes. However, 
comprehensive gene annotation is likely to be a limiting factor in 
the identification of genes involved in polygenic diseases and dis- 
ease-associated disregulated pathways. Many groups are turning to 
proprietary resources to provide these annotations [28], which also 
include freely available annotation data. A more sustainable 
approach, and one that will also support genomic research in devel- 
oping countries, is to invest in improving the freely available anno- 
tation resources. All groups working with high-throughput datasets 
should consider working with the GOC and including in grant 
applications a component that would fund the submission of gene 
annotation data describing their area of interest, by expert curators, 
rather than requesting funding to enable access to proprietary soft- 
ware. The majority of members of the GOC do provide facilities to 
enable researchers to contribute to GO, the question is whether the 
scientific community will acknowledge that their input is required. 
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5 Resources Supporting Expert Contributions to GO 


It is unrealistic to expect a limited number of GO curators and edi- 
tors to understand all areas of biological and medical research. 
Consequently, a range of online facilities have been put in place to 
encourage scientists to review the ontology, to comment on the 
annotations, and to suggest papers for curation. In addition, sev- 
eral GO annotation tools, enable scientists to contribute annota- 
tion data [4, 8, 20]. Furthermore, the Protein2GO curation tool, 
automatically emails authors when one of their papers has been 
annotated, giving the authors an opportunity to comment on the 
curator’s interpretation of their data [20]. 

Scientists interested in helping to improve the GO annotation 
resource can either contact the group providing annotations to their 
species or area of interest (see GOC contributors webpage geneon- 
tology.org/page/go-consortium-contributors-list) or submit 
enquires or information through the GOC webform geneontology. 
org/form/contact-go, which will be forwarded to the relevant data- 
base or group. Useful information to provide would be: details of 
key experimental publications for curation; a review of a particular 
annotation set (associated with a specific gene product or GO term), 
pointing out GO annotations that are missing, wrong, or controver- 
sial; comments on the ontology structure or definitions of GO 
terms, with a reference to support the changes required (Fig. 1). 
This would ensure that any erroneous annotations are removed 
promptly from the GO database, and that information from seminal 
papers is included. Scientists who are confident in using online 
resources may prefer to submit GO annotations, for any species, 
using the PomBase curation tool, CANTO curation.pombase.org/ 
pombe [4]. Information provided by any of these means will be 
forwarded to the appropriate curation or editorial team and con- 
tributors will be notified when their suggestions have been incorpo- 
rated. Full details about contributing to GO are available on the 
GOC website http://geneontology.org/page/contributing-go. 
Professional GO curators review all submitted annotations to ensure 
the annotations follow GO annotation rules and a consistent anno- 
tation approach is taken. 


6 Following GO Developments 


Scientists interested in finding out more about current GOC 
annotation and ontology development projects should sign up to 
the go-friends mailing list.* Alternatively, GO-relevant tweets can 
be followed via #geneontology, or @news4GO. 


http: //mailman.stanford.edu/mailman /listinfo/go-friends 


How Does the Scientific Community Contribute to Gene Ontology? 91 


Has your paper been annotated by a GO curator? Is your favorite gene/protein well annotated? 


Go to the QuickGO browser, www.ebi.ac.uk/QuickGO 


Search QuickGO for your paper’s PubMed identifier (PMID) 


Your paper is listed in the 
search results 


Your paper has been 
annotated by a GO curator 


Click on the PMID 


Look at the annotations 
associated with your paper 


Annotations 


accurately 
represent data 


Annotations could be improved 


Include the PMID 
The experimental species 


Go to a gene/protein database, e.g. NCBI 
Gene, UniProtKB, GeneCards, Wikipedia 


Search for your favorite 
gene/protein record 


‘Nothing found’ is listed in 
the search results 


Look at the associated GO annotations 
Follow links to see the full list of annotations 


Your paper has not been 
annotated by a GO curator 


Gene/protein: Gene/protein: not well 
well annotated annotated; missing or 
wrong annotations 


Contact the GOC: request 
curation of your paper 


Contact the GOC: request 
curation of your gene/protein 


investigated [so that the 
appropriate curation group 
deals with your request] 


Include a list of key publication PMIDs 
Summarize the information that is missing 


(or suggest GO annotations) 
List annotations that are wrong 


Contact information 


Email the relevant database: Webforms: GOC: geneontology.org/form/contact-go 
request re-curation of the paper GOA: www.ebi.ac.uk/GOA/contactus 


Email for human gene product annotation: 


Include the PMID GOA: goa Q ebi.ac.uk 


Summarize the information that is 


missing 
(or suggest GO annotations) 
List annotations that are wrong 


UCL: goannotation @ ucl.ac.uk 
Email list of all GOC contacts: geneontology.org/page/go- 
consortium-contributors-list 


Fig. 1 How research scientists can help to improve the annotation content of GO 
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Part Ill 


Evaluating Gene Ontology Annotations 


Chapter 8 


Evaluating Computational Gene Ontology Annotations 


Nives Skunca, Richard J. Roberts, and Martin Steffen 


Abstract 


Two avenues to understanding gene function are complementary and often overlapping: experimental 
work and computational prediction. While experimental annotation generally produces high-quality 
annotations, it is low throughput. Conversely, computational annotations have broad coverage, but the 
quality of annotations may be variable, and therefore evaluating the quality of computational annotations 
is a critical concern. 

In this chapter, we provide an overview of strategies to evaluate the quality of computational annotations. 
First, we discuss why evaluating quality in this setting is not trivial. We highlight the various issues that 
threaten to bias the evaluation of computational annotations, most of which stem from the incompleteness 
of biological databases. Second, we discuss solutions that address these issues, for example, targeted selection 
of new experimental annotations and leveraging the existing experimental annotations. 


Key words Gene ontology, Evaluation, Tools, Prediction, Annotation, Function 


1 Introduction 


Sequencing a genome is now routine. However, knowledge of the 
gene sequence is only the first step toward understanding it; we ulti- 
mately want to understand the function(s) of each gene in the cell. 
Function annotation using computational methods—for example, 
function propagation via sequence similarity or orthology—can pro- 
duce high-probability annotations for a majority of gene sequences, 
the next step toward understanding. But because computational 
function annotations often generalize the many layers of biological 
complexity, we are interested in evaluating how well these pre- 
dictions reflect biological reality. In this chapter, we discuss the 
evaluation of computational predictions. 

First, we highlight issues that make the evaluation of computa- 
tional predictions challenging, with perhaps the primary challenge 
being the incompleteness of annotation databases: scoring as 
“wrong” those computational predictions that are not yet proven or 
disproven could overestimate the count of “incorrect” predictions, 
and skew perceptions of computational accuracy [1]. 
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1.1 Sources of Gene 
Ontology Annotations: 
Curated 

and Computational 
Annotations 


Second, we discuss solutions that address various aspects of 
database incompleteness. For example, some solutions directly 
address the incompleteness of databases by adding new experimen- 
tal annotations. Yet another solution leverages existing high-quality 
annotations in a current release of a database, and retrospectively 
evaluates previous releases of the annotation databases. Intuitively, 
those annotations that are unchanged through multiple successive 
database releases may be expected to be of higher quality. Additional 
solutions include leveraging negative annotations, though sparse 
but containing valuable information, or performing extensive 
experimentation for a subset of functions of interest. 


In practice, functional annotation of a gene means the assignment 
of a single label, or a set of labels; for example, this might involve 
using BLAST to transfer the labels from another gene. A particu- 
larly valuable set of labels for denoting gene function are those 
derived from the controlled vocabulary established by the Gene 
Ontology (GO) consortium [2], with terms such as “oxygen trans- 
porter activity,” “hemoglobin complex,” and “heme transport,” as 
descriptors of a gene’s Molecular Function, Cellular Component, 
and Biological Process. 

But just as important as the annotation label itself is the knowl- 
edge of the source of the annotation. Based on their source, there 
are two main routes to produce annotations in the GO, and the 
GO Consortium emphasizes this distinction using evidence codes 
[3], as described in Chap. 3 [4]. 

The first route of annotating requires curator's expertise when 
assigning: be it examining primary or secondary literature to assign 
appropriate annotations, manually examining phylogenetic trees to 
infer events of function loss and gain, or deciding on sequence 
similarity thresholds for specific gene families to propagate annota- 
tions. As curated annotation is time consuming, the curators 
streamline their efforts, by focusing annotations on the 12 model 
organisms ([5] and Fig. 1, left). Consequently, fewer than 1% of 
proteins have this type of annotation in the UniProt-GOA data- 
base. Elsewhere, a recent examination of the annotation of 3.3 
million bacterial genes found that fewer than 0.4% of annotations 
can be documented by experiment, although estimates suggest 
that the actual number might be above 1% [6]. 

The second route of annotating, computational prediction 
of function, takes high-quality curated annotations propagates 
them across proteins in nonmodel organisms. Once the pipeline 
for the computational prediction has been setup—a task which is 
by no means trivial—it can be relatively straightforward to obtain 
computational prediction of function across a large number of 
biological sequences. Chapter 5 [7] contains a detailed introduc- 
tion to the methods used in computational annotation. 

Computational prediction of function propagates annotations 
to the vast majority of currently annotated genes (Fig. 1, right). 
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Non-model organisms 
21.5% 


18,840,553 annotations with curator intervention 


78.5% 


Model organisms 


Model organisms 


0.6% 


99.4% 


Non-model organisms 


3,856,684,318 annotations without curator intervention 


Fig. 1 The distribution of the number of computational annotations obtained 
without curator intervention (evidence code IEA) to all other annotations (evi- 
dence codes ISS, IBA, IDA, IMP, ND, IGI, IPI, ISO, TAS, ISA, RCA, IC, NAS, ISM, IEP, 
IGC, EXP, IRD, IKR). The 12 model organisms are: Homo sapiens, Mus musculus, 
Rattus norvegicus, Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis 
thaliana, Gallus gallus, Danio rerio, Dictyostelium discoideum, Saccharomyces 
cerevisiae, Schizosaccharomyces pombe, and Escherichia coli K-12 


Over 99 % of all annotations are created in this manner, and they are 
applied to approximately 76% of all genes [6 ]—the remaining 24% 
of genes typically have no annotation or are listed as “hypothetical 
protein." With the exponential growth of biological databases and 
the labor-intensive nature of manual curation, it is inevitable that 
automated computational predictions will provide the vast majority 
of annotations populating current and future databases. 


2 Challenges of Assessing Computational Prediction of Function 


Computationally predicted annotations are typically assumed to be 
less reliable than manually curated ones. Manual curation may be 
thought of as more cautious, as there is typically a single protein 
being labeled at a time [8], whereas the goal of computational 
prediction is typically more ambitious: labeling a large number of 
proteins—possibly ignoring subtle aspects of the biological reality. 
Arguably the most accurate method to evaluate computational 
predictions of functions is to perform comprehensive experiments 
(e.g., [9]). However, given the number of computational annota- 
tions available, experimental evaluation is prohibitively expensive 
even for a small subset of the available computational annotations. 


100 Nives Skunca et al. 


21 The Elusiveness 
of an Unbiased Gold 
Standard Dataset 


22 Incomplete 
Knowledge 


As a consequence of this discrepancy in numbers, two practical 
obstacles interfere with the assessment of computational function 
prediction: the elusiveness of an unbiased gold standard dataset 
and the incompleteness of the recorded knowledge. 


A major practical obstacle to the evaluation of computational func- 
tion prediction methods is the lack of a gold standard dataset—a 
dataset that would contain complete annotations for representative 
proteins. Such a dataset should not be used to train the prediction 
algorithms (refer to Chap. 5 [7]) and can therefore be used to test 
them. In the current literature, the validation sets mimic the gold 
standard dataset, but they are biased:proteins that are prioritized for 
experimental characterization and curation are often selected for 
their medical or agricultural relevance, and may not be representa- 
tive of the full function space that the computational methods 
address. Moreover, with such incomplete validation sets, it is even 
more difficult to evaluate algorithms specialized for specific func- 
tions—e.g., those identifying membrane-bound proteins. The gold 
standard dataset needs to cover a large breadth of GO terms and 
also have comprehensive annotations for these GO terms. 

In addition to the difficulties of obtaining a gold standard 
dataset, the complexity of the GO graph (see also Chaps. 14 [10] 
and 2 [11])—a necessary simplification of the true biological real- 
ity—poses obstacles to comparison and evaluation. For example, it 
is not trivial to compare the prediction scores between the parent 
(more general) and the child (more specific) GO terms: consider 
the case when computational methods correctly predict annota- 
tions using parent terms, but give erroneous predictions for the 
child terms, i.e., they overpredict. Alternatively, computational 
predictions might miss to predict some child GO terms, i.e., they 
underpredict. One way of handling such situations is to use the 


structure of the GO to probabilistically model protein function, as 
described in [12 ]. 


Underlying the elusiveness of the unbiased gold standard dataset is 
the main issue: the incompleteness of the annotation databases. 
When evaluating computational function annotation methods, we 
typically compare the predictions with the currently available 
knowledge. We confirm the computational annotation when it is 
available in our validation set, and we reject when its negation is 
available, e.g., via the NOT qualifier in the GO database. If nega- 
tive annotations are sparse, as is often the case, it is standard prac- 
tice to consider wrong a prediction when the predicted annotation 
is absent from the validation set, e.g., [13]. This is formally called 
the Closed World Assumption (CWA), the presumption that a 
statement which is true is also known to be true. Conversely, 
under the CWA, that which is not currently known to be true is 
considered false. 
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However, the available knowledge—and consequently the vali- 
dation set—is incomplete; absence of evidence of function does not 
imply evidence of absence of function [14]. This is formally referred 
to as the Open World Assumption (OWA), allowing us to for- 
malize the concept of incomplete knowledge. As a consequence 
of the incompleteness of the validation set, we might be rejecting 
computational predictions that later prove to be correct [1]. 

To illustrate the challenges related to the evaluation of function 
prediction, let us focus on one protein, CLC4E_MOUSE (http:// 
www.uniprot.org/uniprot/Q9R0OS), in particular to two compu- 
tational annotations assigned to this protein at the time of writing: 
the OMA orthology database [15] predicted annotation with “inte- 
gral component of membrane? (GO:0016021) and the InterPro 
pipeline predicted annotation with “carbohydrate binding” 
(GO:0030246). There are no available existing high-quality anno- 
tations that confirm these computational predictions. 

However, if we take a closer look at these annotations, the 
OMA annotation “integral component of membrane,” compared 
to the experimental annotation (evidence code IDA) of *receptor 
activity" is consistent with the experimental annotation: in princi- 
ple, receptors are integral components of membranes. Additionally, 
the literature contains evidence that this protein indeed binds car- 
bohydrates [16], thereby confirming the InterPro prediction. 
Therefore, if we revisit the known annotations and make these 
statements explicitly known to be true, we can confirm them. 

Indeed, for the proteins already present in the UniProt-GOA 
database, we see that curators do revisited them; more than half of 
the proteins have already been assigned a new GO term annotation 
after their first introduction into the database (Fig. 2). An extreme 
example is provided by the Sonic hedgehog entry in mouse 
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One update More updates 


Fig. 2 Distribution of proteins based on the number of times a curator revisits a 
protein with an annotation from the literature (updates with evidence codes EXP, 
IDA, IPI, IMP, IGI, IEP). Among the proteins that have a curated annotation based 
on literature evidence, 56 % are subsequently updated with a new GO term 


102 Nives Skunca et al. 


(http: //www.uniprot.org/uniprot/B3GAP8), which has already 
been revised over a hundred times. 

To meaningfully compare computational function annotations, 
one must account for the Closed World Assumption and have the 
obstacles it implies in mind. But because of the extent of the gap 
between the closed and the open world—think of the “unknown 
unknowns” in the protein function space—a quick-fix solution does 
not exist. However, numerous ways of tackling the problem were 
devised, and we turn our attention to those in the subsequent section. 


3 Approaches to Test Computational Predictions with Experimental Data 


3.1 The COMBREX 
Initiative 


To test computational predictions, experiments have to be con- 
ducted. However, the number of proteins that can be experimen- 
tally tested are dwarfed by the number of genes identified by 
genome sequencing, so a very small number of experimental data 
points must support an enormous number of predicted gene func- 
tion annotations. 

Among the methods to evaluate computational annotations, 
some are focused on quantifying the available information (e.g., the 
number and the specificity of annotations) without providing quality 
judgment (e.g., [17, 18]), while others, the topic of this section, 
strive to evaluate the quality ofthe predictions themselves. Addressing 
some of the complexities of evaluation addressed in the previous 
section, the latter methods provide good templates for future evalu- 
ations of computational methods for function prediction. 


The need for experimentally verified annotations is of sufficient 
scope that it is likely that significant progress can only be made if 
tackled by the entire scientific community. One such attempt at 
community building is focused on bacterial proteins: COMBREX 
(COMputational BRidge to Experiments), along with additional 
efforts such as the Enzyme Function Initiative [19]. The database 
(http://combrex.bu.edu) classifies the gene function status of 3.3 
million bacterial genes, including 13,665 proteins that have experi- 
mentally determined functions [6]. The database contains traceable 
statements to experimentally characterized proteins, thereby provid- 
ing support for a given annotation in a clear and transparent manner. 
COMBREX also developed a tool, named COMBLAST, to associ- 
ate query genes with the various types of experimental evidence and 
data stored in COMBREX. COMBLAST output includes a trace to 
experimental evidence of function via sequence and domain similar- 
ity, to available structural information for related proteins, and to 
association with clinically relevant phenotypes such as antibiotic 
resistance, and other relevant information. It was used to provide 
additional annotations for 1474 prokaryotic genomes [20]. 


3.2 CAFA 
and BioCreAtlvE 


3.3 Evaluating 
Computational 
Predictions Over Time 
Using Successive 
Database Releases 
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Additionally, COMBREX implemented a proof-of-concept 
prioritization scheme that ranked proteins for experimental test- 
ing. For each protein family, distances based on multiple align- 
ments were calculated to help experimentalists easily identify 
those proteins that might be considered most typical of the family 
as a whole. The “ideal” COMBREX target is a protein close to 
many other uncharacterized proteins, and relatively far from any 
protein of known function, but not so far that it would preclude 
high-quality predictions of the protein’s function for the experi- 
mentalist to test. 

COMBREX helped fund the implementation of new technol- 
ogy for the experimental characterization of hypothetical proteins 
from H. pylori [21]. A panel of affinity probes was used in a screen 
to generate initial hypotheses for hypothetical proteins. These 
hypotheses were then tested and confirmed using traditional 
in vitro biochemistry. This approach is complementary to other 
higher throughput methods, such as the parallel screening of 
metabolite pools [22, 23], and activity-based proteomic approaches 
to identify proteins of a particular enzymatic class [24, 25]. 


CAFA (Critical Assessment of Functional Annotation) is another 
community-wide effort to evaluate computational annotations, 
and it promises to uncover some of the most promising algo- 
rithms applied to computational function annotation [13]. Such 
an effort has great utility in establishing success rates of many 
computational annotation methods based on newly generated 
curator knowledge. Chapter 10 [26] covers the details of the 
CAFA evaluation. 

Yet another community effort with a more narrow scope, 
introduced in Chap. 6 [27], BioCreAtIvE (Critical Assessment of 
Information Extraction systems in Biology) [28] is focused on 
evaluating annotations obtained through text mining. When eval- 
uating in this setting, the challenges of evaluation within the open/ 
closed world do not exist: methods are evaluated based on the 
amount of information they can extract from a scientific paper, 
which in itself has defined bounds. Evaluating the extraction qual- 
ity of GO annotations for a small set of human proteins showed the 
extent of the work ahead—text mining algorithms were surpassed 
by the Precision of expert curators [29 ]|—but also showed the areas 
that need to be addressed to improve the quality of computational 
functional annotation using text mining algorithms. 


A strategy to circumvent the problem of the lack of a gold standard 
is to consider changes in experimental annotations in the UniProt- 
GOA database [30]. 

By keeping track of annotations associated with particular 
proteins across successive releases of the UniProt-GOA database, 
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3.4 Increasing 
the Number 

of Negative (‘NOT’) 
Annotations 


one can assess the extent to which newly added experimental anno- 
tations agree with previous computational predictions. As a surro- 
gate for the intuitive notion of specificity, the authors defined a 
reliability measure as the ratio of confirmed computational annota- 
tions to confirmed and rejected/removed ones. One computational 
annotation is deemed confirmed or rejected, depending on whether 
a new, corresponding experimental annotation supports or contra- 
dicts it. Furthermore, if a computational annotation is removed, the 
annotation is deemed implicitly rejected and thus contributes nega- 
tively to the reliability measure. As a surrogate for the intuitive 
notion of sensitivity, coverage was defined as the proportion of newly 
added experimental annotations that had been correctly predicted 
by computational annotations in a previous release. 

Overall, this work found that electronic annotations are more 
reliable than generally believed, to an extent that they are competi- 
tive with annotations inferred by curators when they use evidence 
other than experiments from the primary literature. But this work 
also reported significant variations among inference methods, types 
of annotations, and organisms. For example, the authors noted an 
overall high reliability of annotations obtained from mapping 
Swiss-Prot keywords associated with UniProtKB entries to GO 
terms. Nevertheless, there were exceptions: GO terms related to 
metal ion binding had low reliability in the analysis due to a large 
number of removed annotations. Similarly, a few annotations 
related to ion transport were explicitly rejected with the ‘NOT’ 
qualifier, e.g., for UniProtID Q6R3K9 (“NOT annotation for 
“iron ion transport”) and UniProtID Q9UNA42 (‘NOT annota- 
tion for “monovalent inorganic cation transport”). 


Having a comprehensive set of negative annotations would bridge 
the gap between CWA and OWA; knowing both which functions 
are and are not assigned to a protein will not reject predictions that 
might later prove to be correct. 

While experimentally assigning a function to protein is difficult 
and time consuming, it may be equally challenging to establish that 
a protein does not perform a particular function. For example, 
unsuccessfully testing a protein for a particular function may only 
indicate that it is either more difficult to demonstrate such an activity 
or that it is not present under the given conditions. Because the 
number and the combination of environmental conditions to test— 
e.g., the right partners or the right environmental stimulus—is 
numerous, obtaining a set of ‘NOT annotations might be feasible 
only for a subset of functions. Consequently, the negative annota- 
tions are few and far in between in annotation databases. For exam- 
ple, the January 2015 release ofthe UniProt- GOA database contains 
only 8961 entries that are marked with a ‘NOT’ qualifier. 

There is a small number of reports in the literature stating 
that a protein does not perform a specific function (e.g., [31]), 


3.5 Evaluating 
Computational 
Predictions 

for a Specific Subset 
of GO Terms 


3.6 Simulation 
Studies 


4 Outlook 
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and therefore such sporadic reports cannot be the basis for a 
comprehensive evaluation of computational annotations. Large- 
scale production of negative annotations do exists; for example, 
denoting a set of GO terms that are not likely to be assigned to a 
protein, given its known annotations (e.g., [32]). However, these 
are also computational predictions, they also need to be evaluated. 


The BioCreAtIvE challenge performed annotations without the 
challenges of the open and closed world of function annotations by 
focusing on defined “chunks” of information, scientific papers. In 
the realm of computational predictions, one of the more straight- 
forward ways of avoiding the challenges of the closed world is to 
limit the scope to function where we have close to complete com- 
prehension. In fact, by narrowing the scope of the function annota- 
tion problem, Huttenhower et al. did just that [9 ]. 

The authors evaluated the computational predictions, focusing 
the evaluation on functions related to mitochondrial organization 
and biogenesis in Saccharomyces cerevisiae. They trained their func- 
tion prediction models only on the annotation data available in the 
databases, but performed comprehensive experiments for all genes 
in S. cerevisiae to check whether they have function related to 
mitochondrial organization and biogenesis. This way, they had 
information for every S. cerevisiae gene and were able to evaluate 
the prediction accuracy without the need for the distinction 
between the open and the closed world. 


Simulation studies are abundantly used to evaluate computational 
methods that simulate various evolutionary events, as is done, for 
example, with the simulation framework for genome evolution 
Artificial Life Framework (ALF) [33]. In a related application of 
simulation, simulated erroneous annotations were used to study the 
quality of computational annotations—curated GO annotations 
obtained using methods based on sequence similarity, in the GO 
database denoted with the evidence code ISS [34]. First, the authors 
estimated the level of errors among the ISS GO annotations by 
checking for the effect of randomly adding erroneous annotations. 
Second, they obtained a linear model that connected the propensity 
of (artificially introduced) errors among the annotations with the 
estimate of Precision. Finally, they used this model to estimate the 
baseline Precision at the level where there are no introduced errors. 


Experimental annotations are key to evaluate computational 
methods to predict annotations. Therefore, it is highly desirable 
that three principles govern experimental testing of gene function: 
maximal leveraging of existing experimental information, maximal 
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information gain with each new experiment, and the development 
of higher throughput approaches. 

Maximal leveraging of existing experimental information is 
easiest to obtain through the use of traceable statements, such as 
the use of the “with” field in the UniProt-GOA database: the 
“with” field can record the protein that was used as template to 
transfer annotation through sequence similarity. However, we 
could go a step further, toward statements such as: “Gene X has 
96.8% sequence identity to the experimentally characterized pro- 
tein *HP0050' and therefore this protein is annotated as ‘adenine 
specific DNA methyltransferase’.” Traceable statements greatly 
increase the transparency of a prediction, and allow the users of 
gene annotations to estimate their confidence in the annotation, 
regardless of the source—manual curator or an automated compu- 
tational prediction [35]. 

In order to increase information gain of new experiments, it 
would be beneficial to develop and incorporate experimental 
design principles that help guide the identification of maximally 
informative targets for function validation. One way to maximize 
the information gain from the experimental analysis is to choose 
proteins that generate or improve predictions for many other pro- 
teins across many genomes, as opposed to proteins related to few 
or no other proteins. Alternatively, for function prediction meth- 
ods that report probabilities, the information gain from an 
experiment can be quantified as the reduction in the estimated 
probability of prediction error, summed across all predictions [36]. 

Development of higher throughput approaches for the testing 
of protein function is well underway, and we can hope for the same 
effects as with DNA sequencing. However, at the time of writing, 
a small number of experimental studies contribute much of the 
functional protein annotations collected in the databases, thereby 
biasing the available experimental annotations [8]. Indeed, DNA 
sequencing did not achieve its dramatic cost reductions and 
increases in throughput fortuitously, but rather was the result of 
the systematic investment of hundreds of millions of dollars in 
technology development over two decades. 

Traditionally, the increases of success rates associated with 
computational function annotation are attributed to methodologi- 
cal refinements. However, we must also quantify the influence of 
the data available—e.g., more sequences and more function anno- 
tations—independently of the influence of the algorithms. This 
information is critical, if only because of the rate of aggregation of 
new information in the bioinformatics databases. Indeed, an 
increase in the number of sequenced genomes and an increase in 
the number of function annotations has a dramatic positive effect 
on predictive accuracy of at least one computational method of 
function annotation, phylogenetic profiling [37]. 


5 Conclusion 
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There are a plethora of highly accurate, readily available computa- 
tional function annotation methods available to scientists, and 
state-of-the-art computational function annotations, such as in the 
UniProt-GOA database, are easily accessible to all. However, with- 
out transparent evaluation and benchmarking, it is still extremely 
challenging to differentiate among annotations, and annotation 
methods. 

Going forward, the biocuration community will continue to 
advance along three important lines: increased amounts of biologi- 
cal sequence to be annotated, increased numbers of high-quality 
experimental annotations, and increased predictive accuracy of 
computational methods of annotation. In order to achieve the 
greatest increase in biological knowledge, we will couple the 
advances made in each of these three areas to reach other, espe- 
cially coupling advances in the development of new algorithms 
with robust evaluations of these algorithms based on experimental 
data, with the purpose of generating new, useful biological hypoth- 
eses. Such work will contribute to closing the gap between the 
Open and the Closed worlds, and greatly increase our understanding 
of the large number new sequences that are now generated daily. 
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Chapter 9 


Evaluating Functional Annotations of Enzymes 
Using the Gene Ontology 


Gemma L. Holliday, Rebecca Davidson, Eyal Akiva, and Patricia C. Babbitt 


Abstract 


The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25—29, 2000) is a powerful tool in the 
informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the near- 
est well annotated homologue of a protein of interest to predicting where misannotation has occurred to 
knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter 
we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function 
based on sequence similarity. These can range from identification of misannotation or other errors in a 
predicted function to accurate function prediction for an enzyme of entirely unknown function. Although 
GO annotation applies to any gene products, we focus here a describing our approach for hierarchical 
classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids 
Res 42( Database issue):D521—-530, 2014) as a guide for informed utilisation of annotation transfer based 
on GO terms. 


Key words Catalytic function, Enzyme, Misannotation, Evidence of function 


1 Introduction 


Enzymes are the biological toolkit that organisms use to perform 
the chemistry of life, and the Gene Ontology (GO) [1] represents 
a detailed vocabulary of annotations that captures many of the 
functional nuances of these proteins. However, the relative lack of 
experimentally validated annotations means that the vast majority 
of functional annotations are electronically transferred, which can 
lead to erroneous assumptions and missannotations. Thus, it is 
important to be able to critically examine functional annotations. 
This chapter describes some of the key concepts that are unique for 
applying GO-assisted annotation to enzymes. In particular we 
introduce several techniques to assess their functional annotation 
within the framework of evolutionarily related proteins 
(superfamilies). 


Christophe Dessimoz and Nives Skunca (eds.), The Gene Ontology Handbook, Methods in Molecular Biology, vol. 1446, 
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1.1 Enzyme 
Nomenclature 

and How It Is Used 
inGO 


At its very simplest, an enzyme is a protein that can perform at least 
one overall chemical transformation (the function of the enzyme). 
The overall chemical transformation is often described by the 
Enzyme Commission (EC) Number [2-4] (and see Chap. 19 [5]). 
The EC Number takes the form A.B.C.D, where each position in 
the code is a number. The first number (which ranges from 1 to 6) 
describes the general class of enzyme, the second two numbers 
(which both range from 1 to 99) describe the chemical changes 
occurring in more detail (the exact meaning of the numbers 
depends on the specific class of enzyme you are looking at) and the 
final number (formally ranging from 1 to 999) essentially describes 
the substrate specificity. The EC number has many limitations, not 
least the fact that it doesn't describe the mechanism (the manner in 
which the enzyme performs its overall reaction) and often contains 
no information on cofactors, regulators, etc. Nor is it structurally 
contextual [6] in that similarity in EC number does not necessarily 
infer similarity in sequence or structure, making it sometimes risky 
to use for annotation transfer, especially among remote homolo- 
gous proteins. However, it does do exactly what it says on the tin: 
it defines the overall chemical transformation. This makes it an 
important and powerful tool for many applications that require a 
description of enzyme chemistry. 

The Molecular Function Ontology (MFO) in GO contains the 
full definition of around 70% of all currently available EC numbers. 
Theoretically, the MFO would contain all EC numbers available. 
However, due to many EC numbers not currently being assigned to 
a specific protein identifier within UniProtKB, the coverage is lower 
than might be expected. Another important difference between the 
EC hierarchy and the GO hierarchy is that the latter is often much 
more complex than the simple four steps found in the EC hierarchy. 
For example, the biotin synthase (EC 2.8.1.6) hierarchy is relatively 
simple and follows the four step nomenclature, while the GO hierar- 
chy for [ cytochrome c |-arginine N-methyltransferase (EC 2.1.1.124) 
is much more complex (see Fig. 1). 

Formally, MFO terms describe the activities that occur at the 
molecular level; this includes the “catalytic activity" of enzymes 
or “binding activity". It is important to remember that EC num- 
bers and MFO terms represent activities and not the entities 
(molecules, proteins or complexes) that perform them. Further, 
they do not specify where, when or in what context the action 
takes place. This is usually handled by the Cellular Component 
Ontology. The final ontology in GO, the Biological Process 
Ontology (BPO), provides terms to describe a series of events 
that are accomplished by one or more organised assemblies of 
molecular functions. Each MFO term describes a unique single 
function that means the same thing regardless of the evolutionary 
origin of the entity annotated with that term. Although the BPO 
describes a collection of activities, some BPO terms can be related 
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Fig. 1 Example of the GO hierarchy (taken from the ancestor chart of the QuickGo website (http://www.ebi. 
ac.uk/QuickGO/) showing the relative complexity of the GO hierarchy for two distinct EC numbers). (a) Shows 
the GO hierarchy for biotin synthase, EC 2.8.1.6; (b) shows the GO hierarchy for [cytochrome c]-arginine 
N-methyltransferase, EC 2.1.1.24. The colours of the arrows in the ontology are denoted by the key in the 
centre of the figure. Black connections between terms represent an is a relationship, blue connections repre- 
sent a part. of relationship. The A, B, C and D in red boxes denote the four levels of the EC nomenclature 


to their counterparts in the MFO, e.g. GO:0009102 (biotin bio- 
synthetic process) could be considered to be subsumed with the 
MFO term GO:0004076 (biotin synthase activity) as GO:0009102 
includes the activity GO:0004076, i.e. in such cases, the terms 
are interchangeable for the purpose of evaluation of a protein's 
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Cellular Component Biological Process 


Overall Chemical transformation 


Fig. 2 Hierarchical view of enzyme features. The GO ontologies which describe proteins and their features are 
highlighted in /ight green. Other ontologies available in OBO and BioPortal are shown in the following colours: 
light yellow represents the Amino Acid Ontology, purple represents the Enzyme Mechanism Ontology, blue 
represents the ChEBI ontology and grey represents the Protein Ontology. See also Chap. 5 [10]. The terms 
immediately beneath the parent term are those terms that are covered by ontologies, and required for a protein 
to be considered an enzyme 


annotation. Please see Chap. 2 [7] for a more in-depth discussion 
of the differences between BPO and MFO. 

As a protein, an enzyme has many features that can be described 
and used to define the enzyme’s function, from the primary amino 
acid sequence to the enzyme’s quaternary structure (biological 
assembly), the chemistry that is catalysed, to the localisation of the 
enzyme. Features can also denote the presence (or absence) of 
active site residues to confirm (or deny) a predicted function, such 
as EC class, using the compositional makeup of a protein amino 
acid sequences [8, 9]. Nevertheless, for the many proteins of 
unknown function deposited in genome projects, prediction of the 
molecular, biological, and cellular functions remains a daunting 
challenge. Figure 2 provides a view of enzyme-specific features 
along with the GO ontologies that can also be used to describe 
them. Because it captures these features through a systematic and 
hierarchical classification system, GO is heavily used as a standard 
for evaluation of function prediction methods. For example, a reg- 
ular competition, the Critical Assessment of Functional Annotation 
(CAFA) has brought many in the function prediction community 
together to evaluate automated protein function prediction 
algorithms in assigning GO terms to protein sequences [11]. Please 
see Chap. 10 [12] for a more detailed discussion of CAFA. 
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1.2 Why Annotate 
Enzymes 

with the Gene 
Ontology? 


Although there are many different features and methods that can 
(and are) used to predict the function of a protein, there are several 
advantages to using GO as a broadly applied standard. Firstly, GO 
has good coverage of known and predicted functions so that nearly 
all proteins in GO will have at least one associated annotation. 
Secondly, annotations associated with a protein are accompanied 
by an evidence code, along with the information describing that 
evidence source. Within the SFLD [13] each annotation has an 
associated confidence level which is linked to both the evidence 
code, source of the evidence (including the type of experiment) 
and the curator's experience. For example, experimental evidence 
for an annotation is considered as having high confidence whereas 
predictions generated by computational methods are considered of 
lower confidence (Chap. 3 [14]). In general there are three types 
of evidence for the assignment of a GO term to a protein: 


1. Fully manually curated: These proteins will usually have an 
associated experimental evidence that has been identified by 
human curators and who have added relevant evidence codes. 
For the purposes of the SFLD and this chapter, these are con- 
sidered high confidence and will have a greater weight than 
any other annotation confidence level. 


2. Computational with some curator input: These are computa- 
tionally based annotations that have been propagated through 
curator derived rules, and are generally considered to be of 
medium confidence by the SFLD. Due to the huge proportion 
of sequences in large public databases now available, over 98 % 
of GO annotations are inferred computationally [15]. 


3. Computational with no curator input: These annotations that 
have been computationally inferred from information without 


any curator input into the inference rules and are considered to 
be of the lowest confidence by the SELD. 


All computationally derived annotations rely upon prior knowl- 
edge, and so if the rule is not sufficiently detailed, it can still lead to 
the propagation ofannotation errors (see Misannotation Section 1.4). 

Assigning confidence to annotations is highly subjective [16], 
however, as one person may consider high-throughput screening, 
which more frequently is used to predict protein-binding or sub- 
cellular locations rather than EC number, of low confidence. This 
is because such experiments often have a relatively high number of 
false positives that can generate bias in the analysis. However, 
depending on what your research questions are, you may con- 
sider such data of high confidence. It all depends on what field 
you are in and what your needs are. Generally speaking, the more 
reproducible the experiment(s), the higher confidence you can 
have in their results. Thus, even low-to-medium confident annota- 
tions (from Table 1) may lead to a high-confidence annotation. 
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Table 1 


Some example proteins (listed by UniProtKB accession) with their associated annotations, source of 
the annotation (the SFLD is the Structure-Function Linkage Database, Swiss-Prot is the curated 
portion of UniProtKB) and the confidence of those annotations along with the reason that confidence 


level has been assigned 


SFLD 

Protein ID from Annotated protein confidence Types of evidence or reasoning used to annotate 

UniProtKB [17] function (source) level the function 

Q9X0Z6 [FeFe]-hydrogenase High Inferred from experimental analysis of protein 
maturase (From structures, genomic context and results from 
SFLD and spectroscopic assay. 
Swiss-Prot) 

QI1S94 Biotin Synthase Medium Inferred from similarity to other BioB enzymes. 
(BioB) (From Matched by similarity to other BioB sequences 
SFLD and and catalytic residues are fully conserved. 
Swiss-Prot) 

Q58692 Biotin Synthase Low Inferred from similarity to other BioB enzymes. 
(BioB) (From Matched by similarity to other BioB sequences. 
Swiss-Prot) Whilst all residues required for binding the 


iron-sulphur clusters are conserved, all the 
catalytic residues (those required for the BioB 
reaction to occur) are not. Also has no biotin 
synthase genomic context. 


1.3 Annotation 
Transfer Under 
the Superfamily Model 


For example the GO Reference Code GO_REF:0000003 provides 
automatic GO annotations based on the mapping of EC numbers 
to MFO terms, so although annotated as IEA, these annotations 
can be considered of higher confidence [18]. Some examples of 
high-, medium- and low-confidence annotations are shown in 
Table 1, along with reference to the approach used in SwissProt 
and the SFLD to describe their reliability. 


We define here an enzyme (or protein) superfamily as the largest 
grouping of enzymes for which a common ancestry can be identi- 
fied. Superfamilies can be defined in many different ways, and every 
resource that utilises them in the bioinformatics community has 
probably used a slightly different interpretation and method to col- 
late their data. However, they can be broadly classified as structure- 
based, in which the three-dimensional structures of all available 
proteins in a superfamily have been aligned and confirmed as homol- 
ogous, or sequence based, where the sequences have been used 
rather than structures. Many resources use a combination of 
approaches. Examples of superfamily based resources include CATH 
[19], Gene3D [20], SCOP and SUPERFAMILY [21], which are 
primarily structure based, and Pfam [22], PANTHER [23] and 
TIGRFAMs [24], which are primarily sequence based. A third defi- 
nition of a superfamily includes a mechanistic component, i.e. a set 
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1.4 Annotation 
Transfer 
and Misannotation 


of sequences must not only be homologous, but there must be some 
level of conserved chemical capability within the set, e.g. catalytic 
residues, cofactors, substrate and/or product substructures or 
mechanistic steps. An example of such a resource is the SFLD and 
we will focus on this resource with respect to evaluating GO annota- 
tions for enzymes that are members of a defined superfamily. 

The SFLD (http:/ /sfld.rbvi.ucsf.edu/) is a manually curated 
classification resource describing structure-function relationships 
for functionally diverse enzyme superfamilies [25]. Members of 
such superfamilies are diverse in their overall reactions yet share a 
common ancestor and some conserved active site features associ- 
ated with conserved functional attributes such as a partial reaction 
or molecular subgraph that all substrates or products may have in 
common. Thus, despite their different functions, members of these 
superfamilies often “look alike” which can make them particularly 
prone to misannotation. To address this complexity and enable 
reliable transfer of functional features to unknowns only for those 
members for which we have sufficient functional information, we 
subdivide superfamily members into subgroups using sequence 
information (and where available, structural information), and 
lastly into families, defined as sets of enzymes known to catalyse the 
same reaction using the same mechanistic strategy and catalytic 
machinery. At each level of the hierarchy, there are conserved 
chemical capabilities, which include one or more of the conserved 
key residues that are responsible for the catalysed function; the 
small molecule subgraph that all the substrates (or products) may 
include and any conserved partial reactions. A subgroup is essen- 
tially created by observing a similarity threshold at which all mem- 
bers of the subgroup have more in common with one another than 
they do with members of another subgroup. (Thresholds derived 
from similarity calculations can use many different metrics, such as 
simple database search programs like BLAST [26] or Hidden 
Markov Models (HMMs) [27] generated as part of the curation 
protocol to describe a subgroup or family.) 


Annotation transfer is a hard problem to solve, partly because it is 
not always easy to know exactly how a function should be trans- 
ferred. Oftentimes, function and sequence similarity do not track 
well [28, 29] and so, if sequence similarity is the only criterion that 
has been used for annotation transfer, the inference of function 
may have low confidence. However, it is also very difficult to say 
whether a protein is truly misannotated, especially if no fairly simi- 
lar protein has been experimentally characterised that could be 
used for comparison and evaluation of functional features such as 
the presence of similar functionally important active site residues. 
As we have previously shown [30-32] there is a truly staggering 
amount of protein space that has yet to be explored experimentally 
and that makes it very difficult to make definitive statements as to 
the validity of an annotation. 
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Misannotation can come from many sources, from a human 
making an error in curation, which is then propagated from the 
top down, to an automated annotation transfer rule that is slightly 
too lax, to the use of transitivity to transfer annotation, e.g. where 
protein A is annotated with function X, protein B is 70% identical 
to A, and so is also assigned function X, protein C is 65 % identical 
to protein B, and so is also assigned function X. Whilst this may be 
the correct function, protein C may have a much lower similarity 
to protein A, and thus the annotation transfer may be “risky” [33]. 
As in the example shown in Fig. 3, sequence similarity networks 
(SSNs) [34] offer a powerful way to highlight where potential 


Average Bit score 


Fig. 3 Example of identifying misannotation using an SSN in the biotin synthase- 
like subgroup in the SFLD. Nodes colours represent different families in the sub- 
group, where red represent those sets of sequences annotated as canonical biotin 
synthase in the SFLD, blue represent the HydE sequences, green the PyIB 
sequences and magenta the HmdB sequences. The nodes shown as /arge dia- 
monds are those annotated as BioB in GO, clearly showing that the annotation 
transfer for BioB is too broad. The network summarizes the similarity relationships 
between 5907 sequences. It consists of 2547 representative nodes (nodes repre- 
sent proteins that share greater than 9096 identity) and 2,133,749 edges, where 
an edge is the average similarity of pairwise BLAST F-values between all possible 
pairs of the sequences within the connected nodes. In this case, edges are 
included if this average is more significant than an E-value of 16-25. The organic 
layout in Cytoscape 3.2.1 is used for graphical depiction. Subheading 2.1 
described how such similarity networks are created 
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misannotation may occur. In this network, all the nodes are 
connected via a homologous domain, the Radical SAM domain. 
Thus, the observed differences in the rest of the protein mean that 
the functions of the proteins may also be quite different. For details 
on the creation of SSNs, see Subheading 2.1. Cases where annota- 
tions may be suspect can often be evaluated based on a protein’s 
assigned name, and from the GO terms inferred for that protein. 

Not all annotations are created equal, even amongst experi- 
mentally validated annotations, and it is important to consider how 
well evidence supporting an annotation should be trusted. For 
example, in the glutathione transferase (GST) superfamily, the cog- 
nate reaction is often not known as the assays performed use a rela- 
tively standard set on non-physiological substrates to infer the type 
of reaction catalysed by each enzyme that is studied. Moreover, 
GSTs are often highly promiscuous for two or more different reac- 
tions again complicating function assignment [32]. That being said, 
the availability of even a small amount of experimental evidence can 
help guide future experiments aimed at functional characterisation. 
A new ontology, the Confidence Information Ontology (CIO) [16], 
aims to help annotators assign confidence to evidence. For example, 
evidence that has been reproduced from many different experi- 
ments may have an intrinsically higher confidence than evidence 
that has only been reported once. 


2 Using GO Annotations to Visualise Data in Sequence Similarity Networks 


Sequence similarity networks (SSNs) are a key tool that we use in 
the Structure-Function Linkage Database (SELD) as they give an 
immediately accessible view of the superfamily and the relation- 
ships between proteins in this set. This in turn allows a user to 
identify boundaries at which they might reasonably expect to see 
proteins performing a similar function in a similar manner. As was 
shown in Fig. 3, the GO annotation for BioB covered several dif- 
ferent SFLD families. These annotation terms have been assigned 
through a variety of methods, but mostly inferred from electronic 
annotation (i.e. rule-based annotation transfer as shown in Fig. 4). 

From the networks shown previously, a user may intuitively see 
that there are three basic groups of proteins. Further, it could be 
hypothesised that these groups could have different functions 
(which is indeed the case in this particular example). Thus, the user 
may be left with the question: How do I know what boundaries to 
use for high confidence in the annotation transfer? Figure 5 shows 
another network, this time coloured by the average bit-score for the 
sequences in a node against the SFLD HMM for BioB. This net- 
work exemplifies how (1) sequence similarity (network clusters) 
corresponds with the sequence pattern generated by SFLD curators 
to represent the BioB family, and (2) HMM true-positive gathering 
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2.1 Creating 
Sequence Similarity 
Networks 


Fig. 4 Biotin synthase-like subgroup coloured by confidence of evidence (as 
shown in Table 1). The diamond shaped nodes are all annotated as Biotin 
Synthase in GO. Red nodes are those that only have low confidence annotations, 
the orange nodes are those that have at least one medium-confidence annota- 
tions and the green are those that have at least one high-confidence annotation. 
Grey nodes have no BioB annotations. Node and edge numbers, as well as 
e-value threshold are as in Fig. 3 


bit-score cut-off can be fine-tuned. By combining what we know 
about the protein set from the GO annotation (Fig. 3) with the 
HMM bit-score (Fig. 5) it is possible to be much more confident in 
the annotations for the proteins in the red/brown group in Fig. 5. 


SSNs provide a visually intuitive method for viewing large sets of 
similarities between proteins [34]. Although their generation is 
subject to size limitations for truly large data sets, they can be easily 
created and visualised for several thousand sequences. There are 
many ways to create such networks, the networks created by the 
SFLD are generated by Pythoscape [35], a freely available software 
that can be downloaded, installed and can be run locally. Recently, 
web servers have been described that will generate networks for 
users. For example, The Enzyme Similarity Tool (EFI-EST) [36] 
created by the Enzyme Function Initiative will take a known set of 
proteins (e.g. Pfam or InterPro [37] groups) and generate net- 
works for users from that set. A similarity network is simply a set of 
nodes (representing a set of amino acid sequences as described in 
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Fig. 5 Example of a sequence similarity network to estimate subgroups for use 
in initial steps of the curation process and to guide fine-tuning the hidden Markov 
model (HMM) true-positive detection threshold of an enzyme family (here for the 
Biotin synthase (BioB) family). Node colours represent the average Bit-score of 
the BioB family HMM for all sequences represented by the node. The mapping 
between colours and average Bit scores is given in the legend. Nodes with thick 
borders represent proteins that belong to the BioB family according to SFLD 
annotation. Diamonds represent nodes that include proteins with BioB family 
annotation according to GO. The final BioB HMM detection threshold was 
achieved for the SFLD by further exploration of more strict E-value thresholds for 
edge representation, and was set to 241.6. Node and edge numbers, as well as 
E-value threshold, are as in Fig. 3 


this chapter, for example) and edges (representing the similarity 
between those nodes). For the SSNs shown in this chapter, edges 
represent similarities scored by pairwise BLAST E-values (used as 
scores) between the source and target sequences. Using simple 
metrics such as these, relatively small networks are trivial and fast to 
produce from a simple all-against-all BLAST calculation. However, 
the number of edges produced depends on the similarity between 
all the nodes to each other, so that for comparisons of a large num- 
ber of closely related sequences, the number of edges will vastly 
exceed the number of nodes, quickly outpacing computational 
resources for generating and viewing networks. As a result, some 
data reduction will eventually be necessary. The SFLD uses repre- 
sentative networks where each node represents a set of highly 
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2.2 Determining 
Over- and Under- 
represented GO Terms 
in a Set of Species- 
Diverse Proteins 


similar sequences and the edges between them represent the mean 
E-value similarity between all the sequences in the source node and 
all the sequences in the target node. As shown in Fig. 3, node 
graphical attributes (e.g. shape and colour) used to represent GO 
terms for the proteins shown are a powerful way to recognise rela- 
tionships between sequence and functional similarities. Importantly, 
statistical analyses must be carried out to verify the significance of 
these trends, as we show below. 


A common use of GO enrichment analysis is to evaluate sets of 
differentially expressed genes that are up- or down-regulated 
under certain conditions [38]. The resulting analysis identifies 
which GO terms are over- or under-represented within the set in 
question. With respect to enzyme superfamilies, the traditional 
implementation of enrichment analysis will not work well as there 
are often very many different species from different kingdoms in 
the dataset. However, there are several ways that we can still utilise 
sets of annotated proteins to evaluate the level of enrichment for 
GO terms. 

The simplest method and least rigorous, is to take the set of 
proteins being evaluated, count up the number of times a single 
annotation occurs (including duplicate occurrences for a single 
enzyme, as these have different evidence sources) and up-weight 
for experimental (or high confidence) annotations. Then, by divid- 
ing by the number of proteins in the set, any annotation with a 
ratio greater than one can be considered “significant”. 

A more rigorous treatment assumes that for a set of closely 
related proteins (i.e. belonging to a family) a specific GO term is 
said to be over-represented when the number of proteins assigned 
to that term within the family of interest is enriched versus the 
background model as determined by a probability distribution. 
Thus, there are two decisions that need to be made, firstly, identi- 
fying the background model and then which probability function 
to use. The background model is dependent on the dataset and 
the question that is being asked. For example in the SFLD model, 
we might use the subgroup or superfamily and a random back- 
ground model that gives us an idea of what annotations could 
occur purely by chance. The lack of high (and sometimes also 
medium) confidence annotations is another complication in exam- 
ining enrichment of terms. If one is using [EA annotations to infer 
function, the assertions can quickly become circular (with inferred 
annotations being transferred to other proteins which in turn are 
used to annotate yet more proteins), leading to results which 
themselves are of low confidence. Similarly, if very few proteins are 
explicitly annotated with a high/medium confidence annotation, 
the measure of significance can be skewed due to low counts in 
the dataset. The choice of the probability function is also going to 
depend somewhat on what question is being asked, but the 
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23 Using Semantic 
Significance with GO 


2.4 Use 

of Orthogonal 
Information 

to Evaluate 
GO Annotation 


hypergeometric test (used for a finite universe) is common in GO 
analyses [39, 40]. For more detail on enrichment analysis, see 
Chap. 13 [41]. 


Instead of simply transferring annotations utilising sequence 
homology and BLAST scores, many tools are now available (e.g. 
Argot2 [42] and GraSM [43]) that utilise semantic similarity [42— 
46]. Here, the idea is that in controlled vocabularies, the degree of 
relatedness between two entities can be assessed by comparing the 
semantic relationship (meanings) between their annotations. The 
semantic similarity measure is returned as a numerical value that 
quantifies the relationship between two GO terms, or two sets of 
terms annotating two proteins. 

GO is well suited to such an approach, for example many chil- 
dren terms in the GO directed acyclic graph (DAG) have a similar 
vocabulary to their parents. The nature of the GO DAG means that 
a protein with a function A will also inherit the more generic func- 
tions that appear higher up in the DAG; this can be one or more 
functions, depending on the DAG. For example, an ion transmem- 
brane transporter activity (GO:0015075) is a term similar to volt- 
age-gated ion channel activity (GO:0005244), the latter of which is 
a descendent of the former, albeit separated by the ion channel 
activity (GO:0005216) term. Thus, the ancestry and semantic simi- 
larity lends greater weight to the confidence in the annotation. 

Such similarity measures can be used instead of (or in conjunc- 
tion with) sequence similarity measures. Indeed, it has been shown 
[47] that there is good correlation between the protein sequence 
similarity and the GO annotation semantic similarity for proteins in 
Swiss-Prot, the reviewed section of UniProtKB [17]. Consistent 
results, however, are often a feature not only of the branch of GO 
to which the annotations belong, but also the number of high con- 
fidence annotations that are being used. For a more detailed and 
comprehensive discussion of the various methods, see Pesquita 
et al. [44] and Chap. 12 [48]. 


In the example shown in Fig. 3, it is clear that many more nodes in 
the subgroup are annotated as biotin synthase by GO than match 
the stringent criteria set within the SFLD, which not only require a 
significant E-value (or Bit Score) to transfer annotation, but the 
presence of the conserved key residues. As mentioned earlier, one 
key advantage to using GO annotations over those of some other 
resources is the evidence code (and associated source of that evi- 
dence) as shown in Fig. 4. As indicated by that network, when using 
GO annotations, it is important to also consider the associated con- 
fidence level for the evidence used in assigning an annotation (see 
Table 1). In Fig. 4, only a few annotations are supported by high- 
confidence evidence. Alternatively, if a protein has a high confidence 
experimental evidence code for membership in a family of interest 
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yet is not included by annotators in that family, then the definition 
of that family may be too strict, indicating that a more permissive 
gathering threshold for assignment to the family should be used. 

Another way of assessing the veracity of the annotation trans- 
ferred to a query protein is to examine both the annotations of the 
proteins that are closest to it in similarity as well as other entirely 
different types of information. 

One example of such orthogonal information is the genomic 
context of the protein. It can be hypothesised that if a protein 
occurs in a pathway, then the other proteins involved in that path- 
way may be co-located within the genome [49]. This association is 
frequently found in prokaryotes, and to a lesser extent in plants 
and fungi. Genomic proximity of pathway components is infre- 
quent in metazoans, thus genomic context as a means to function 
prediction is more useful for bacterial enzymes. Additionally, other 
genes in the same genomic neighbourhood may be relevant to 
understanding the function of both the protein of interest and of 
the associated pathway. A common genomic context for a query 
protein and a homologue provides further support for assignment 
of that function. (However, the genomic distance between path- 
way components in different organisms may vary for many reasons, 
thus the lack of similar genomic context does not suggest that the 
functions of a query and a similar homologue are different.) 

Another type of orthogonal information that can be used can 
be deduced from protein domains present in a query protein and 
their associated annotations—what are the predicted domains pres- 
ent in the protein, do they all match the assigned function or are 
there anomalies. A good service for identifying such domains is 
InterProScan [50]. Further, any protein in UniProtKB will have the 
predicted InterPro identifiers annotated in the record (along with 
other predicted annotations from resources such as Pfam and 
CATH), along with the evidence supporting those predictions. 
Such sequence context can also be obtained using hidden Markov 
models (HMMs) [51], which is the technique used by InterPro, 
Pfam, Gene3D, SUPERFAMILY and the SFLD to place new 
sequences into families, subgroups (SFLD-specific term) and super- 
families (see Fig. 6). 


3 Challenges and Caveats 


3.1 The Use 
of Sequence Similarity 
Network 


A significant challenge with using SSNs to help evaluate GO 
annotations is that SSNs are not always trivial to use without a 
detailed knowledge of the superfamilies that they describe. For 
example, choosing an appropriate threshold for drawing edges is 
critical to obtaining network clustering patterns useful for deeper 
evaluation. In Fig. 3, HydE (the blue nodes) are not currently 
annotated as such in GO, but are annotated instead as BioB. Thus, 
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Family Hmm 
E-Value 
Biotin Synthase 2.8 x 1078 
HmdB 8.8 x 10-31 
HydE 1.1 x 1073 
X  bioY  ] 


InterPro Family: Biotin synthase 
(IPRO24177) 


Family HMM 
E-Value 

HydE 6.1 x 107150 

HmdB 8.6 x 10-36 


Biotin Synthase 6.4 x 1075 


InterPro Family: [FeFe]-hydrogenase maturation HydE, 
radical SAM (IPRO24021) 


Fig. 6 Biotin synthase-like subgroup SSN showing where the biotin synthase GO annotations are shown as 
large diamonds. Two proteins, one from the BioB set (red nodes, top right) and one from the HydE set (blue 
nodes), bottom left, are shown with some associate orthogonal information: genomic context highlighted in 
light cyan boxes, their HMM match results for the query protein against the three top scoring families in the 
subgroup are shown in the tables, and family membership (according to InterProScan) shown in coloured text 
(blue for HydE and red for BioB). Node and edge numbers, as well as E-value threshold are as in Fig. 3. All the 
proteins are connected via a homologous domain (the Radical SAM domain). Thus, the observed differences in 
the rest of the protein mean that the functions of the proteins may also be quite different 


3.2 Annotation 
Transfer Is 
Challenging Because 
Evolution Is Complex 


the evaluation of the network becomes significantly more complex. 
It is also not always clear what signal is being picked up in the edge 
data for large networks. It is usually assumed that all the proteins 
in the set share a single domain, but this is often only clear when 
the network is examined in greater detail. 


Even using the powerful tools and classifications provided by GO, 
interpreting protein function in many cases requires more in-depth 
analysis. For several reasons, it is not always easy to confidently 
determine that a protein is not correctly annotated. Firstly, how 
closely related is the enzyme to the group of interest? Perhaps we 
can only be relatively certain of its superfamily membership, or 
maybe we can assign it to a more detailed level of the functional 
hierarchy. If it fits into a more detailed classification level, how well 
does it fit? At what threshold do we begin to see false positives 
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creeping into the results list? Using networks, we can also examine 
the closest neighbours that have differing function and ask whether 
there are similarities in the function (e.g. Broderick et al. [52] used 
sequence similarity networks to help determine the function of 
HydE). Another complicating issue is whether a protein performs 
one or more promiscuous functions, albeit with a lesser efficacy. 

Another important piece of evidence that can be used to sup- 
port an annotation is conservation of the key residues, so it is 
important to assess if the protein of interest has all the relevant 
functional residues. Although GO includes an evidence code to 
handle this concept (Inferred from Key Residues, IKR), it is often 
not included in the electronic inference of annotations. It is impor- 
tant to note, however, that there are evolutionary events that may 
“scramble” the sequence, leaving it unclear to an initial examina- 
tion whether the residues are conserved or not. A prime example is 
the case in which a circular permutation has occurred. Thus, it is 
important to look at whether there are other residues (or patterns 
of residues) that could perform the function of the “missing” resi- 
dues. It is also possible that conservative mutations have occurred, 
and these may also have the ability to perform the function of the 
“missing” residues [53]. 

Another consideration with function evaluation is the occurrence 
of moonlighting proteins. These are proteins that are identical in 
terms of sequence but perform different functions in different cellu- 
lar locations or species; for example argininosuccinate lyase 
(UniProtKB id P24058) is also a delta crystalline which serves as an 
eye lens protein when it is found in birds and reptiles [54]. A good 
source of information on moonlighting proteins is MoonProt 
(http: //www.moonlightingproteins.org/) [55]. Such cases may 
arise from physiological use in many different conditions such as dif- 
ferent subcellular localisations or regulatory pathways. The full extent 
of proteins that moonlight is currently not known, although to date, 
almost 300 cases have been reported in MoonProt. Another compli- 
cating factor for understanding the evolution of enzyme function is 
the apparent evolution of the same reaction specificity from different 
intermediate nodes in the phylogenetic tree for the superfamily, for 
example the N-succinyl amino acid racemase and the muconate lac- 
tonising enzyme families in the enolase superfamily [56, 57 ]. 

Finally, does the protein have a multi-domain architecture and/ 
or is it part ofa non-covalent protein-protein interaction in the cell? 
An example of a functional protein requiring multiple chains that 
are transiently coordinated in the cell is pyruvate dehydrogenase 
(acetyl-transferring) (EC 1.2.4.1). This protein has an active site at 
the interface between pyruvate dehydrogenase El component sub- 
unit alpha (UniProtKB identifier P21873) and beta (UniProtKB 
identifier P21874), both of which are required for activity. 
Thus, transfer of annotation relating to this function to an unknown 
(and hence evaluation of misannotation) needs to include both pro- 
teins. Similarly, a single chain with multiple domains, e.g. biotin 
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3.3 Plurality Vote 
May Not 
Be the Best Route 


4 Conclusions 


biosynthesis bifunctional protein BioAB (UniProtKB identifier 
P53656), which contains a BioA and BioB domain, has two differ- 
ent functions associated with it. In this example, these two func- 
tions are distinct from one another so that annotation of this protein 
only with one function or the other could represent a type of misan- 
notation (especially as a GO term is assigned to a protein, not a 
specific segment of its amino acid sequence). 


In some cases, proteins are annotated by some type of “plurality 
voting”. Plurality voting is simply assuming that the more annota- 
tions that come from different predictors, the more likely these are 
to be correct. As we have shown in this chapter (and others before 
us [58]), this is not always the case. An especially good example of 
where plurality voting fails is in the case of the lysozyme mechanism. 
For over 50 years, the mechanism was assumed to be dissociative, 
but a single experiment provided evidence of a covalent intermedi- 
ate being formed in the crystal structure, calling into question the 
dissociate mechanism. If plurality voting were applied in ongoing 
annotations, the old mechanism would still be considered correct. 
That being said, it is more difficult to identify problems of this type 
if experimental evidence challenging an annotation is unavailable. 
In such cases, we must always look at all the available evidence to 
transfer function and where there are disagreements between pre- 
dicted functions, a more detailed examination is needed. Only 
when we have resolved such issues can we have any true confidence 
in the plurality vote. Work by Kristensen et al. [59] provides a 
good example of the value of this approach. By using three-dimen- 
sional templates generated using knowledge of the evolutionarily 
important residues, they showed that they could identify a single 
most likely function in 61% of 3D structures from the Structural 
Genomics Initiative, and in those cases the correct function was 
identified with an 87% accuracy. 


Experimentalists simply can't keep up with the huge volume of data 
that is being produced in today's high-throughput labs, from whole 
genome and population sequencing efforts to large-scale assays and 
structure generation. Almost all proteins will have at least one asso- 
ciated GO annotation, and such coverage makes GO an incredibly 
powerful tool, especially as it has the ability to handle all the known 
function information at different levels of biological granularity, has 
explicit tools to capture high-throughput experimental data and 
utilises an ontology to store the annotation and associated relation- 
ships. Although over 98 % of all GO annotations are computation- 
ally inferred, with the ever-increasing state of knowledge, these 
annotation transfers are becoming more confident [15] as rule- 
based annotations gain in specificity due to more data being 
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available. However, there is still a long way to go before we can 
simply take an IEA annotation at face value. Confidence in annota- 
tions transferred electronically has to be taken into account: How 
many different sources have come to the same conclusion (using 
different methods)? How many different proteins’ functions have 
been determined in a single experiment? Similarly, whilst burden of 
evidence is a useful gauge in determining the significance of an 
annotation, there is also the question of when substantially different 
annotations were captured in GO and other resources—perhaps 
there has been a new experiment that calls into question the origi- 
nal annotation. It is also important to look at whether other, similar 
proteins were annotated long ago or are based on new experimental 
evidence. There is a wealth of data available that relates to enzymes 
and their functions. This ranges from the highest level of associating 
a protein with a superfamily (and thus giving some information as 
to the amino acid residues that are evolutionarily conserved), to the 
most detailed level of molecular function. We can use all of these 
data to aid us in evaluating the GO annotations for a given protein 
(or set of proteins), from the electronically inferred annotation for 
protein domain structure, to the genomic context and protein fea- 
tures (such as conserved residues). The more data that are available 
to back up (or refute) a given GO annotation, the more confident 
one can be in it (or not, as the case may be). 
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Chapter 10 


Community-Wide Evaluation of Computational Function 
Prediction 


Iddo Friedberg and Predrag Radivojac 


Abstract 


A biological experiment is the most reliable way of assigning function to a protein. However, in the era of 
high-throughput sequencing, scientists are unable to carry out experiments to determine the function of 
every single gene product. Therefore, to gain insights into the activity of these molecules and guide experi- 
ments, we must rely on computational means to functionally annotate the majority of sequence data. To 
understand how well these algorithms perform, we have established a challenge involving a broad scientific 
community in which we evaluate different annotation methods according to their ability to predict the 
associations between previously unannotated protein sequences and Gene Ontology terms. Here we dis- 
cuss the rationale, benefits, and issues associated with evaluating computational methods in an ongoing 
community-wide challenge. 


Key words Function prediction, Algorithms, Evaluation, Machine learning 


1 Introduction 


Molecular biology has become a high volume information science. 
This rapid transformation has taken place over the past two decades 
and has been chiefly enabled by two technological advances: (1) 
affordable and accessible high-throughput sequencing platforms, 
sequence diagnostic platforms, and proteomic platforms and (2) 
affordable and accessible computing platforms for managing and 
analyzing these data. It is estimated that sequence data accumulates 
at the rate of 100 exabases per day (1 exabase =10!8 bases) [35]. 
However, the available sequence data are of limited use without 
understanding their biological implications. Therefore, the develop- 
ment of computational methods that provide clues about functional 
roles of biological macromolecules is of primary importance. 

Many function prediction methods have been developed over 
the past two decades [12, 31]. Some are based on sequence align- 
ments to proteins for which the function has been experimentally 
established [4, 11, 24], yet others exploit other types of data such as 
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protein structure [26, 27], protein and gene expression data [17], 
macromolecular interactions [21, 25], scientific literature [3], or a 
combination of several data types [9, 34, 36]. Typically, each new 
method is trained and evaluated on different data. Therefore, estab- 
lishing best practices in method development and evaluating the 
accuracy of these methods in a standardized and unbiased setting is 
important. To help choose an appropriate method for a particular 
task, scientists often form community challenges for evaluating 
methods [7]. The scope of these challenges extends beyond testing 
methods: they have been successful in invigorating their respective 
fields of research by building communities and producing new ideas 
and collaborations (e.g., [20]). 

In this chapter we discuss a community-wide effort whose goal 
is to help understand the state of affairs in computational protein 
function prediction and drive the field forward. We are holding a 
series of challenges which we named the Critical Assessment of 
Functional Annotation, or CAFA. CAFA was first held in 2010- 
2011 (CAFAL) and included 23 groups from 14 countries who 
entered 54 computational function prediction methods that were 
assessed for their accuracy. To the best of our knowledge, this was 
the first large-scale effort to provide insights into the strengths and 
weaknesses of protein function prediction software in the bioinfor- 
matics community. CAFA2 was held in 2013-2014, and more 
than doubled the number of groups (56) and participating meth- 
ods (126). Although several repetitions of the CAFA challenge 
would likely give accurate trajectory of the field, there are valuable 
lessons already learned from the two CAFA efforts. 

For further reading on CAFAL, the results were reported in 
full in [30]. As of this time, the results of CAFA2 are still unpub- 
lished and will be reported in the near future. The preprint of the 
paper is available on arXiv [19]. 


2 Organization of the CAFA Challenge 


We begin our explanation of CAFA by describing the participants. 
The CAFA challenge generally involves the following groups: the 
organizers, the assessors, the biocurators, the steering committee, 
and the predictors (Fig. 1a). 

The main role of the organizers is to run CAFA smoothly and 
efficiently. They advertise the challenge to recruit predictors, coordi- 
nate activities with the assessors, report to the steering committee, 
establish the set of challenges and types of evaluation, and run the 
CAFA web site and social networks. The organizers also compile 
CAFA data and coordinate the publication process. The assessors 
develop assessment rules, write and maintain assessment software, 
collect the submitted prediction data, assess the data, and present 
the evaluations to the community. The assessors work together with 
the organizers and the steering committee on standardizing 
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a CAFA organization 


Biocurators Organizers 


* Provide functional * Direct the experiment 


annotations (CAFA2) * Connect all parties 


Steering Committee 


* Oversee the experiment * Collect predictions * Develop methodology 


* Ensure integrity * Evaluate methods * Submit predictions 


b Experiment timeline 
to tı tə t3 


Prediction Annotation growth Assessment 


Fig. 1 The organizational structure of the CAFA experiment. (a) Five groups of participants in the experiment 
together with their main roles. Organizers, assessors, and biocurators cannot participate as predictors. 
(b) Timeline of the experiment 


submission formats and developing assessment rules. The biocura- 
tors joined the experiment during CAFA2: they provide additional 
functional annotations that may be particularly interesting for the 
challenge. The steering committee members are in regular contact 
with the organizers and assessors. They provide advice and guidance 
that ensures the quality and integrity of the experiment. Finally, the 
largest group, the predictors, consists of research groups who 
develop methods for protein function prediction and submit their 
predictions for evaluation. The organizers, assessors, and biocurators 
are not allowed to officially evaluate their own methods in CAFA. 
CAFA is run as a timed challenge (Fig.1b). At time £j a large 
number of experimentally unannotated proteins are made public by 
the organizers and the predictors are given several months, until time 
h, to upload their predictions to the CAFA server. At time h the 
experiment enters a waiting period of at least several months, during 
which the experimental annotations are allowed to accumulate in 
databases such as Swiss-Prot [2] and UniProt-GOA [16]. These 
newly accumulated annotations are collected at time t and are 
expected to provide experimental annotations for a subset of original 
proteins. The performance of participating methods is then analyzed 
between time points 4 and #; and presented to the community at 
time 7;. It is important to mention that unlike some machine learning 
challenges, CAFA organizers do not provide training data that is 
required to be used. CAFA, thus, evaluates a combination of 
biological knowledge, the ability to collect and curate training data, 
and the ability to develop advanced computational methodology. 
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We have previously described some of the principles that guide 
us in organizing CAFA [13]. It is important to mention that CAFA 
is associated with the Automated Function Prediction Special 
Interest Group (Function-SIG) that is regularly held at the 
Intelligent Systems for Molecular Biology (ISMB) conference 
[37]. These meetings provide a forum for exchanging ideas and 
communicating research among the participants. Function-SIG 
also serves as the venue at which CAFA results are initially pre- 
sented and where the feedback from the community is sought. 


3 The Gene Ontology Provides the Functional Repertoire for CAFA 


Computational function prediction methods have been reviewed 
extensively [12, 31] and are also discussed in Chapter 5 [8]. Briefly, a 
function prediction method can be described as a classifier: an algo- 
rithm that is tasked with correctly assigning biological function to a 
given protein. This task, however, is arbitrarily difficult unless the 
function comes from a finite, preferably small, set of functional terms. 
Thus, given an unannotated protein sequence and a set of available 
functional terms, a predictor is tasked with associating terms to a 
protein, giving a score (ideally, a probability) to each association. 

The Gene Ontology (GO) [1] is a natural choice when looking 
for a standardized, controlled vocabulary for functional annota- 
tion. GO’s high adoption rate in the protein annotation commu- 
nity helped ensure CAFA’s attractiveness, as many groups were 
already developing function prediction methods based on GO, or 
could migrate their methods to GO as the ontology of choice. A 
second consideration is GO’s ongoing maintenance: GO is con- 
tinuously maintained by the Gene Ontology Consortium, edited 
and expanded based on ongoing discoveries related to the function 
of biological macromolecules. 

One useful characteristic of the basic GO is that its directed acy- 
clic graph structure can be used to quantify the information provided 
by the annotation; for details on the GO structure see Chaps. 1 and 
3 [14, 15]. Intuitively, this can be explained as follows: the annota- 
tion term “Nucleic acid binding? is less specific than “DNA binding” 
and, therefore, is less informative (or has a lower information con- 
tent). (A more precise definition of information content and its use in 
GO can be found in [23, 32].) The following question arises: if we 
know that the protein is annotated with the term “Nucleic acid bind- 
ing,” how can we quantify the additional information provided by 
the term “DNA binding” or incorrect information provided by the 
term “RNA binding”? The hierarchical nature of GO is therefore 
important in determining proper metrics for annotation accuracy. 
The way this is done will be discussed in Sect. 4.2. 

When annotating a protein with one or more GO terms, the 
association of each GO term with the protein should be described 
using an Evidence Code (EC), indicating how the annotation is 
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supported. For example, the Experimental Evidence code (EXP) is 
used in an annotation to indicate that an experimental assay has been 
located in the literature, whose results indicate a gene product's func- 
tion. Other experimental evidence codes include Inferred by 
Expression Pattern (IEP), Inferred from Genetic Interaction (IGI), 
and Inferred from Direct Assay (IDA), among others. Computational 
evidence codes include lines of evidence that were generated by com- 
putational analysis, such as orthology (ISO), genomic context (IGC), 
or identification of key residues (IKR). Evidence codes are not 
intended to be a measure of trust in the annotation, but rather a 
measure of provenance for the annotation itself. However, annota- 
tions with experimental evidence are regarded as more reliable than 
computational ones, having a provenance stemming from experi- 
mental verification. In CAFA, we treat proteins annotated with 
experimental evidence codes as a “gold standard” for the purpose of 
assessing predictions, as explained in the next section. The computa- 
tional evidence codes are treated as predictions. 

From the point of view of a computational challenge, it is impor- 
tant to emphasize that the hierarchical nature of the GO graph leads 
to the property of consistency or True Path Rule in functional annota- 
tion. Consistency means that when annotating a protein with a given 
GO term, it is automatically annotated with all the ancestors of that 
term. For example, a valid prediction cannot include “DNA binding” 
but exclude “Nucleic acid binding” from the ontology because DNA 
binding implies nucleic acid binding. We say that a prediction is not 
consistent if it includes a child term, but excludes its parent. In fact, 
the UniProt resource and other databases do not even list these parent 
terms from a protein’s experimental annotation. If a protein is anno- 
tated with several terms, a valid complete annotation will automati- 
cally include all parent terms of the given terms, propagated to the 
root(s) of the ontology. The result is that a protein's annotation can 
be seen as a consistent sub-graph of GO. Since any computational 
method effectively chooses one ofa vast number of possible consistent 
sub-graphs as its prediction, the sheer size of the functional repertoire 
suggests that function prediction is non-trivial. 


4 Comparing the Performance of Prediction Methods 


41 Establishing 
Standards of Truth 


In the CAFA challenge, we ask the participants to associate a large 
number of proteins with GO terms and provide a probability score 
for each such association. Having associated a set of GO sub-graphs 
with a given confidence, the next step is to assess how accurate 
these predictions are. This involves: (1) establishing standards of 
truth and (2) establishing a set of assessment metrics. 


The main challenge to establishing a standard-of-truth set for test- 
ing function prediction methods is to find a large set of correctly 
annotated proteins whose functions were, until recently, unknown. 
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An obvious choice would be to ask experimental scientists to pro- 
vide these data from their labs. However, scientists prefer to keep 
the time between discovery and publication as brief as possible, 
which means that there is only a small window in which new exper- 
imental annotations are not widely known and can be used for 
assessment. Furthermore, each experimental group has its own 
“data sequestration window” making it hard to establish a com- 
mon time for all data providers to sequester their data. Finally, to 
establish a good statistical baseline for assessing prediction method 
performance, a large number of prediction targets are needed, 
which is problematic since most laboratories research one or only a 
few proteins each. High-throughput experiments, on the other 
hand, provide a large number of annotations, but those tend to 
concentrate only on few functions, and generally provide annota- 
tions that have a lower information content [32]. 

Given these constraints, we decided that CAFA would not ini- 
tially rely on direct communication between the CAFA organizers 
and experimental scientists to provide new functional data. Instead, 
CAFA relies primarily on established biocuration activities around 
the world: we use annotation databases to conduct CAFA as a 
time-based challenge. To do so, we exploit the following dynamics 
that occurs in annotation databases: protein annotation databases 
grow over time. Many proteins that at a given time £j do not have 
experimentally verified annotation, but later, some of proteins may 
gain experimental annotations, as biocurators add these data into 
the databases. This subset of proteins that were not experimentally 
annotated at 4, but gained experimental annotations at £5, are the 
ones that we use as a test set during assessment (Fig. 1b). In CAFAI 
we reviewed the growth of Swiss-Prot over time and chose 50,000 
target proteins that had no experimental annotation in the 
Molecular Function or Biological Process ontologies of GO. At 4, 
out of those 50,000 targets we identified 866 benchmark proteins, 
i.e., targets that gained experimental annotation in the Molecular 
Function and/or Biological Process ontologies. While a bench- 
mark set of 866 proteins constitutes only 1.7% of the number of 
original targets, it is a large enough set for assessing performance 
of prediction methods. To conclude, exploiting the history of the 
Swiss-Prot database enabled its use as the source for standard-of- 
truth data for CAFA. In CAFA2, we have also considered experi- 
mental annotations from UniProt-GOA [16] and established 3681 
benchmark proteins out of 100,000 targets (3.7%). 

One criticism of a time-based challenge is that when assessing 
predictions, we still may not have a full knowledge of a protein's 
function. A protein may have gained experimental validation for 
function fi, but it may also have another function, say f, associated 
with it, which has not been experimentally validated by the time 7. 
A method predicting f; may be judged to have made a false-positive 
prediction, even though it is correct (only we do not know it yet). 
This problem, known as the “incomplete knowledge problem” or 


4.2 Assessment 
Metrics 
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the “open world problem” [10] is discussed in detail in Chapter 8 
[33]. Although the incomplete knowledge problem may impact the 
accuracy of time-based evaluations, its actual impact in CAFA has 
not been substantial. There are several reasons for this and are also 
discussed in, including the robustness of the evaluation metrics used 
in CAFA, and that the newly added terms may be unexpected and 
more difficult to predict. The influence of incomplete data and con- 
ditions under which it can affect a time-based challenge were inves- 
tigated and discussed in [18]. Another criticism of CAFA is that the 
experimental functional annotations are not unbiased because some 
terms have a much higher frequency than others due to artificial 
considerations. There are two chief reasons for this bias: first, high- 
throughput assays typically assign shallow terms to proteins, but 
being high throughput means they can dominate the experimentally 
verified annotations in the databases. Second, biomedical research is 
driven by interest in specific areas of human health, resulting in over- 
representation of health-related functions [32]. Unfortunately, 
CAFAl and CAFA2 could not guarantee unbiased evaluation. 
However, we will expand the challenge in CAFA3 to collect genome- 
wide experimental evidence for several biological terms. Such an 
assessment will result in unbiased evaluation on those specific terms. 


When assessing the prediction quality of different methods, two 
questions come to mind. First, what makes a good prediction? 
Second, how can one score and rank prediction methods? There is 
no simple answer to either of these questions. As GO comprises 
three ontologies that deal with different aspects of biological func- 
tion, different methods should be ranked separately with respect to 
how well they perform in Molecular Function, Biological Process, 
or the Cellular Component ontologies. Some methods are trained 
to predict only for a subset of any given GO graph. For example, 
they may only provide predictions of DNA-binding proteins or of 
mitochondrial-targeted proteins. Furthermore, some methods are 
trained only on a single species or a subset of species (say, eukary- 
otes), or using specific types of data such as protein structure, and 
it does not make sense to test them on benchmark sets for which 
they were not trained. To address this issue, CAFA scored methods 
not only in general performance, but also on specific subsets of 
proteins taken from humans and model organisms, including Mus 
musculus, Rattus norvegicus, Arabidopsis thaliana, Drosophila 
melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, 
Dictyostelium discoideum, and Escherichia coli. In CAFA2, we 
extended this evaluation to also assess the methods only on bench- 
mark proteins on which they made predictions; i.e., the methods 
were not penalized for omitting any benchmark protein. 

One way to view function prediction is as an information 
retrieval problem, where the most relevant functional terms should 
be correctly retrieved from GO and properly assigned to the amino 
acid sequence at hand. Since each term in the ontology implies 
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some or all of its ancestors,' a function prediction program’s task is 
to assign the best consistent sub-graph of the ontology to each 
new protein and output a prediction score for this sub-graph and/ 
or each predicted term. An intuitive scoring mechanism for this 
type of problem is to treat each term independently and provide 
the precision-recall curve. We chose this evaluation as our main 
evaluation in CAFAI and CAFA2. 

Let us provide more detail. Consider a single protein on which 
evaluation is carried out, but keep in mind that CAFA eventually 
averages all metrics over the set of benchmark proteins. Let now T' 
be a set of experimentally determined nodes and Pa non-empty set 
of predicted nodes in the ontology for the given protein. Precision 
(pr) and recall (rc) are defined as 


PAT POT 
Maps p,n ZOT, 


pr(P,T) = 
IT| 


where|Plis the number of predicted terms,| T|is the number of 
experimentally determined terms, and | P N T]is the number of terms 
appearing in both Pand 7; see Fig.2 for an illustrative example of this 
measure. Usually, however, methods will associate scores with each 
predicted term and then a set of terms P will be established by defin- 
ing a score threshold 7; i.e., all predicted terms with scores greater 
than 7 will constitute the set P. By varying the decision threshold 
t€ [0,1], the precision and recall of each method can be plotted as a 
curve (pr(2),rc(£)); where one axis is the precision and the other the 
recall; see Fig. 3 for an illustration of pr-rc curves and [30] for prc 
curves in CAFA1. To compile the precision-recall information into a 
single number that would allow easy comparison between methods, 
we used the maximum harmonic mean of precision and recall any- 
where on the curve, or the maximum F,-measure which we call Fna 


where we modified pr(£) and rc(£) to reflect the dependency on f. 
It is worth pointing out that the F-measure used in CAFA places 
equal emphasis on precision and recall as it is unclear which of the 
two should be weighted more. One alternative to F, would be the 
use of a combined measure that weighs precision over recall, which 
reflects the preference of many biologists for few answers with a 
high fraction of correctly predicted terms (high precision) over 
many answers with a lower fraction of correct predictions (high 
recall); the rationale for this tradeoff is illustrated in Fig.3. 


H . ž ETN 
Some types of edges in Gene Ontology violate the transitivity property (con- 
sistency assumption), but they are not frequent. 
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a Predicted function b True function 
Cellular Biological process — Biological process 
process process 


g . p! 


Cell differentiation 


Fig. 2 CAFA assessment metrics. (a) Red nodes are the predicted terms P for a particular decision threshold in 
a hypothetical ontology and (b) blue nodes are the true, experimentally determined terms T. The circled terms 
represent the overlap between the predicted sub-graph and the true sub-graph. There are two nodes (circled) 
in the intersection of P and 7, where| P| 25 andl TI =3. This sets the prediction's precision at 2/5=0.4 and 
recall at 2/3 = 0.667, with F,=2 x 0.4 x 0.667 / (0.4 + 0.667) = 0.5. The remaining uncertainty (ru) is the 
information content of the uncircled blue node in panel (b), while the misinformation (mi) is the total informa- 
tion content of the uncircled red nodes in panel (a). An information content of any node v is calculated from a 
representative database as — logPr(vl Pa(W)); i.e., the probability that the node is present in a protein’s annota- 
tion given that all its parents are also present in its annotation 


However, preferring precision over recall in a hierarchical setting 
can steer methods to focus on shallow (less informative) terms in 
the ontology and thus be of limited use. At the same time, putting 
more emphasis on recall may lead to overprediction, a situation in 
which many or most of the predicted terms are incorrect. For this 
reason, we decided to equally weight precision and recall. 
Additional metrics within the precision-recall framework have 
been considered, though not implemented yet. 

Precision and recall are useful because they are easy to interpret: 
a precision of 1/2 means that one half of all predicted terms are cor- 
rect, whereas a recall of 1/3 means that one third of the experimen- 
tal terms have been recovered by the predictor. Unfortunately, 
precision-recall curves and £j, while simple and interpretable mea- 
sures for evaluating ontology-based predictions, are limited because 
they ignore the hierarchical nature of the ontology and dependencies 
among terms. They also do not directly capture the information 
content of the predicted terms. Assessment metrics that take into 
account the information content of the terms were developed in the 
past [22, 23, 29], and are also detailed in Chapter 12 [28]. In 
CAFA2 we used an information-theoretic measure in which each 
term is assigned a probability that is dependent on the probabilities 
of its direct parents. These probabilities are calculated from the fre- 
quencies of the terms in the database used to generate the CAFA 
targets. The entire ontology graph, thus, can be seen as a simple 
Bayesian network [5]. Using this representation, two information- 
theoretic analogs of precision and recall can be constructed. We refer 
to these quantities as misinformation (mi), the information content 
attributed to the nodes in the predicted graph that are incorrect, and 
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Fig. 3 Precision-recall curves and remaining uncertainty-misinformation curves. This figure illustrates the 
need for multiple assessment metrics, and understanding the context in which the metrics are used. (a) two 
pr-rc curves corresponding to two prediction methods M, and M». The point on each curve that gives Fray is 
marked as a circle. Although the two methods have a similar performance according to Fma, method M, 
achieves its best performance at high recall values, whereas method M, achieves its best performance at high 
precision values. (b) two ru-mi curves corresponding to the same two prediction methods with marked points 
where the minimum semantic distance is achieved. Although the two methods have similar performance in the 
pr-rc space, method M, outperforms M» in ru-mi space. Note, however, that the performance in ru-mi space 
depends on the frequencies of occurrence of every term in the database. Thus, two methods may score differ- 
ently in their Smin when the reference database changes over time, or using a different database 


remaining uncertainty (ru), the information content of all nodes 
that belong to the true annotation but not the predicted annotation. 
More formally, if T'is a set of experimentally determined nodes and 
Pa set of predicted nodes in the ontology, then 


ru(P,T) 2 - Y. logPr(v|Pa(v); mi(P,T)=— Y; logPr(v | Pa(v)), 


veT-P veP-T 


where Pa(v) is the set of parent terms of the node vin the ontology 
(Fig.2). A single performance measure to rank methods, the mini- 
mum semantic distance Spin, is the minimum distance from the 
origin to the curve (ru(£), mi(Z)),. It is defined as 


Sus = min [eo + mie, 


where £z 1. We typically choose k= 2, in which case Smin is the mini- 
mum Euclidean distance between the ru—mi curve and the origin of 
the coordinate system (Fig. 3b). The ru—mi plots and Smin metrics 
compare the true and predicted annotation graphs by adding an 
additional weighting component to high-information nodes. In that 
manner, predictions with a higher information content will be 
assigned larger weights. The semantic distance has been a useful 
measure in CAFA2 as it properly accounts for term dependencies in 
the ontology. However, this approach also has limitations in that it 
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relies on an assumed Bayesian network as a generative model of pro- 
tein function as well as on the available databases of protein func- 
tional annotations where term frequencies change over time. While 
the latter limitation can be remedied by more robust estimation of 
term frequencies in a large set of organisms, the performance accura- 
cies in this setting are generally less comparable over two different 
CAFA experiments than in the precision-recall setting. 


Critical assessment challenges have been successfully adopted in a 
number of fields due to several factors. First, the recognition that 
improvements to methods are indeed necessary. Second, the ability 
of the community to mobilize enough of its members to engage in 
a challenge. Mobilizing a community is not a trivial task, as groups 
have their own research priorities and only a limited amount of 
resources to achieve them, which may deter them from undertak- 
ing a time-consuming and competitive effort a challenge may pose. 
At the same time, there are quite a few incentives to join a com- 
munity challenge. Testing one's method objectively by a third 
party can establish credibility, help point out flaws, and suggest 
improvements. Engaging with other groups may lead to collabora- 
tions and other opportunities. Finally, the promise of doing well in 
a challenge can be a strong incentive heralding a group's excellence 
in their field. Since the assessment metrics are crucial to the perfor- 
mance of the teams, large efforts are made to create multiple met- 
rics and to describe exactly what they measure. Good challenge 
organizers try to be attentive to the requests of the participants, 
and to have the rules of the challenge evolve based on the needs of 
the community. An understanding that a challenge's ultimate goal 
is to improve methodologies and that it takes several rounds of 
repeating the challenge to see results. 

The first two CAFA challenges helped clarify that protein func- 
tion prediction is a vibrant field, but also one of the most challeng- 
ing tasks in computational biology. For example, CAFA provided 
evidence that the available function prediction algorithms 
outperform a straightforward use of sequence alignments in func- 
tion transfer. The performance of methods in the Molecular 
Function category has consistently been reliable and also showed 
progress over time (unpublished results from CAFA2). On the 
other hand, the performance in the Biological Process or Cellular 
Component ontologies has not yet met expectations. One of the 
reasons for this may be that the terms in these ontologies are less 
predictable using amino acid sequence data and instead would rely 
more on high-quality systems data; e.g., see [6]. The challenge has 
also helped clarify the problems of evaluation, both in terms of eval- 
uating over consistent sub-graphs in the ontology but also in the 
presence of incomplete and biased molecular data. Finally, although 


144 Iddo Friedberg and Predrag Radivojac 


it is still early, some best practices in the field are beginning to 
emerge. Exploiting multiple types of data is typically advantageous, 
although we have observed that both machine learning expertise 
and good biological insights tend to result in strong performance. 
Overall, while the methods in the Molecular Function ontology 
seem to be maturing, in part because of the strong signal in sequence 
data, the methods in the Biological Process and Cellular Component 
ontologies still appear to be in the early stages of development. 
With the help of better data over time, we expect significant 
improvements in these categories in the future CAFA experiments. 

Overall, CAFA generated a strong positive response to the call 
for both challenge rounds, with the number of participants sub- 
stantially growing between CAFA1 (102 participants) and CAFA2 
(147). This indicates that there exists significant interest in devel- 
oping computational protein function prediction methods, in 
understanding how well they perform, and in improving their per- 
formance. In CAFA2 we preserved the experiment rules, ontolo- 
gies, and metrics we used in CAFA1, but also added new ones to 
better capture the capabilities of different methods. The CAFA3 
experiment will further improve evaluation by facilitating unbiased 
evaluation for several select functional terms. 

More rounds of CAFA are needed to know if computational 
methods will improve as a direct result of this challenge. But given the 
community's growth and growing interest, we believe that CAFA is a 
welcome addition to the community of protein function annotators. 
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Get GO! Retrieving GO Data Using AmiGO, 
QuickGO, API, Files, and Tools 


Monica Munoz-Torres and Seth Carbon 


Abstract 


The Gene Ontology Consortium (GOC) produces a wealth of resources widely used throughout the 
scientific community. In this chapter, we discuss the different ways in which researchers can access the 
resources of the GOC. We here share details about the mechanics of obtaining GO annotations, both by 
manually browsing, querying, and downloading data from the GO website, as well as computationally 
accessing the resources from the command line, including the ability to restrict the data being retrieved to 
subsets with only certain attributes. 


Key words Gene ontology, Ontology, Annotation resources, Annotation, Genomics, Transcriptomics, 
Bioinformatics, Biocuration, Curation, Access, AmiGO, QuickGO 


1 Introduction 


The efforts of the Gene Ontology Consortium (GOC) are focused 
on three major subjects: (1) the development and maintenance of 
the ontologies; (2) the annotation of gene products, which includes 
making associations between the ontologies and the genes and 
gene products in all collaborating databases; and (3) the develop- 
ment of tools that facilitate the creation, maintenance, and use of 
the ontologies. This chapter is focused on the mechanics of obtain- 
ing GO annotations, both directly and computationally, including 
the ability to restrict the data being retrieved to subsets with only 
certain attributes. 

GO data is the culmination of various forms of curation, made 
accessible through a variety of interfaces and downloadable in dif- 
ferent forms, depending on your intended use. Because the data 
and software landscape are constantly changing, it is hard to cover 
with any permanence the best way to access the data; this inherent 
limitation should be kept in mind as we navigate through this section. 
This chapter is intended as an overview of the different ways users 
can access GO data (via web portals, downloadable files, and API) 
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a quick description of basic software used by GO, and as a reference 
for where to find more detailed and up-to-date information about 
these subjects. 


2 Web Interfaces to Access the GO 


21 AmiGO 


22 QuickGO 


This section covers the online interfaces for accessing and interacting 
with the data using standard web browsers. Most consumers of the 
GO can make use of data browsers such as AmiGO, QuickGO, and 
data browsers embedded within more specific databases. 


AmiGO ([1] http://amigo.geneontology.org; Fig. 1a) is the official 
web-based open-source tool for querying, browsing, and visualizing 
the Gene Ontology and annotations collected from the MODs 
(model organism databases), UniProtKB, and other sources (com- 
plete list of member institutions currently contributing to the GOC 
at  http:;//geneontology.org/page/ go-consortium-contributors- 
list). Notable features include: basic searching, browsing, the ability 
to download custom data sets, and a common question “wizard” 
interface. Recent changes have brought improvements both in 
speed and the variety of search modes, as well as the availability of 
additional data types, such as the display of annotation extensions 
(see Chap. 17 [2]) and display of protein forms (splice variants and 
proteins with post translational modifications). More details about 
the latest improvements on the AmiGO browser can also be found 
at GOC—Munoz-Torres (CA), 2015 [3]. 


The Gene Ontology Annotation (GOA) project at the European 
Molecular Biology Laboratory's European Bioinformatics Institute 
(EMBL-EBI) also makes available the QuickGO browser ([4]; 
http://www.ebi.ac.uk/QuickGO; Fig. 1b), a web-based tool that 
allows easy browsing of the Gene Ontology (GO) and all associ- 
ated electronic and manual GO annotations provided by the GO 
Consortium annotation groups. Included in its many features are 
extensive search and filter capabilities for GO annotations, a pow- 
erful integrated subset/slim interface, as well as an integrated his- 
torical view ofthe terms. For data consumption, QuickGO provides 
broad-ranging web services and cart functionality (a way of persist- 
ing abstract elements, like term IDs, between parts ofthe QuickGO 
web application). 

AmiGO and QuickGO make use of the same GO data sets, 
with somewhat different implementations according to the require- 
ments of funding sources and respective users. AmiGO, in its 
entirety, is a product of the GO Consortium and is the official 
channel for dissemination of the GO data sets, adhering to funding 
recommendations from NHGRI-NIH. QuickGO is produced, 
managed, and funded by EMBL-EBI; the members of QuickGO's 
managing team are also members of the GOC. 


Grebe: to get started 


in exploring GO data. Quick search: with 


auto-complete. 


Use the Grebe Search Wizard to get started in exploring the 
Gene Ontology data. 


interactively search the Gene Ontology data for annotations, 
gene products, and terms using a powerful search syntax and 
fiers. 


Statistics 


Za Term enrichment | |,| 
Your genes here Staisucdlanalvsica,.... stes sbat te Gene Ontology dit 
: ; E 


filters are available from this to allow the generation of 
of GO annotations, mapped to sequence identifiers of your choice. 


Investigate GO slims 


GO slims are lists of GO terms that have been selected from the full set of 
terms available from the Gene Ontology project. 


How to get the GO 
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GOOSE: query a legacy 
GO database using SQL 


Use GOOSE to query a legacy GO database with SQL or edit 
one of the templates. 


And Much Moro... 


aye 
L> 

Many more tools are avaiable from the software list, such as 
alternate searching modes, Visualize, non-JavaScript pages. 


C319 August 2011 - Changes to the Term 
Basket 


“9.14 June 2011 - New term history displays 
(2.20 April 2011 - Display improvements 
QuickGO News Archive 


GO slims can be used to generate a 
annotation data they can be used to see how a set of 
cateaorized g annotation 


Search and filter 
GO annotation sets. | 


GO sims can be found at the 


focused view of part or with 
be 


data and the relationships 
(d, QuickGO Tips 


Tutorial 


This page allows you to view the changes to GO, optionally filtered by date, 
term identifier, or type of change. 


Investigate GO Slims: 
Focused visualization 
of part of the GO. 


Other resources for GO analysis 


Fig. 1 Landing pages for the AmiGO (a) and QuickGO (b) browsers. A few features are highlighted for each browser 
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23 Other Browsers 


24 Term 
Enrichment Tool 


25 A Simple 
Example of Data 
Exploration Using 
AmiGO (See Fig. 2) 


The ontology component of the GO is also searchable and browsable 
from various third party generic ontology browsers such as OntoBee 
(http://ontobee.org), the EMBL-EBI Ontology Lookup Service 
(OLS) (http://www.ebi.ac.uk/ontology-lookup), OLSVis (http:// 
ols.wordvis.com), and BioPortal (http://bioportal.bioontology. 
org). Each of these systems has their own particular strengths—for 
example OntoBee is aimed at the semantic web community, and pro- 
vides the ontology as part of a linked data platform [5], whereas 
OLSVis is geared towards visualization. However, none of these 
browsers currently provide access to the annotations. 


One of the main uses of the GO is to perform enrichment analysis 
on gene sets. For example, given a set of genes that are up regulated 
under certain conditions, an enrichment analysis will find which GO 
terms are overrepresented (or underrepresented) using the available 
annotations for that gene set. The GO website offers a service that 
directly connects users with the enrichment analysis tool from the 
PANTHER Classification System [6]. The PANTHER database is 
up-to-date with GO annotations, and their enrichment tool is 
driven by GO data. Further details about this enrichment tool, as 
well as a list of supported gene IDs, are available from the PANTHER 
website at http://www.pantherdb.org/ and at http://www.pant- 
herdb.org/tips/tips batchIdSearch supportedId.jsp. More infor- 
mation on enrichment analysis using the GO is available Chap. 13 
[7] on “Gene-Category Analysis.” 


To give a concrete example of the type of easy GO data exploration 
that can be accomplished using a web interface, we here provide an 
example where a user on the AmiGO annotation search interface 
(http://amigo.geneontology.org/amigo/search/annotation) is 
trying to find associations between genes/gene products and epi- 
thelial processes, while searching only data outside those available 
for human, and which have experimental evidence. 
The user could: 


— Type “epithe?” into the text filter box ( Free-text filtering) to the 
left of the results area. 


— Open the “Taxon” facet and select the [-] next to “Homo 
sapiens.” 

— Open the “Evidence type” facet and select the [+] next to “exper- 
imental evidence.” 


The remaining results would fit the initial search criteria. 
However, suppose that the user wants to further refine their search 
to strictly look at all GO annotations that are directly or indirectly 
annotated to the GO term “epithelial cell differentiation” 
(GO:0009913). Following the steps above, they could: 


Information about Annotations search © 


Found entities 
x 


B 


Your search is pinned to these filters 
document category: annotation 


Free-text filtering 


User filters 
- taxon label: Homo sapiens 
evidence type closure: experimenta! evidence 


regulates closure label: epithelia! cel 
differentiation 


Source 
Assigned by 


Ontology (aspect) 


Evidence type 


PANTHER family 
Qualifier 

Taxon 

Direct annotation 


Inferred annotation 


Annotation extension 
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PANT 
Evidence Evidence wn. ATEM oum. Reterence 
t 


Assigned 
by 
dictyBase 


sorocarp stalk cei dictyBase 
differentiation 

epidermal cel RGO 
Gifierentidl'k CO 0009913 (go to the term details page for 


syntaxin 2 epidermal ce differentiation) 


endothelin receptor type B 


hes family bHLH hair cell differentiation 
transcription factor 1 


Fig. 2 Data exploration using the AmiGO annotation search interface. All results from this example are listed in 
panel (a). (b) Shows a detail about the filters applied throughout the search, listed under "User filters." An 
example of the details that appear for each gene or gene product is visible in (c): note that the information 
about the GO term ID for "epithelial cell differentiation” (GO:0009913) appears when users hover over the 


“Direct annotation" details 


— Open the “Inferred annotation” facet and select the [+] next 
to “epithelial cell differentiation,” then 


— Remove the text filter by clicking the [x] next to the text entry. 


This would leave the user with all GO annotations directly or 


indirectly 


annotated with “epithelial cell differentiation” 


(GO:0009913), that are not from human data, and have some 
kind of experimental evidence associated with them. 


3 GO Files: Description and Availability 


GO data files contain the current and long-term output of ontology 
and annotation efforts that are used for exchanging data across 
various systems. There are several use cases where it may be easier 
to mine the data directly from the files using a variety of tools. 
The most commonly used raw data files can be broken down into 
two categories: ontology and association files. 
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3.1 Ontology 


In the context of GO, ontologies are graph structures comprised of 
classes for molecular functions, the biological processes they contrib- 
ute to, the cellular locations where they occur, and the relationships 
connecting them all, in a species-independent manner [3]. Each term 
in the GO has defined relationships to one or more other terms in the 
same domain, and sometimes to other domains. Additional informa- 
tion about ontologies in general is also available from Chap. 1 [8]. 

GO ontology data are available from the GO website at http:// 
geneontology.org/page /download-ontology. There are three dif- 
ferent editions of the GO, in increasing order of complexity: go- 
basic, go, and go-plus. 


go-basic: This basic edition of the GO is filtered such that annota- 
tions can be propagated up the graph. The relations included are 
is_a, part_of, regulates, negatively_regulates, and positively_regu- 
lates. It is important to note that this version excludes relationships 
that cross the three main GO hierarchies. Many legacy tools that 
use the GO make these assumptions about the GO, so we make 
this version available in order to support these tools. This version 
of the GO ontology is available in OBO format only. 


go: This core edition of the GO includes additional relationship 
types, including some that span the three GO hierarchies, such as 
has part and occurs in, connecting the otherwise disjoint hierar- 
chies found in go-basic. This version of the GO ontology is avail- 
able in two formats, OBO and OWL-RDFE/XML. 


go-plus: This is the most expressive edition of the GO; it includes 
more relationships than go and connections to external ontologies, 
including the Chemical Entities of Biological Interest ontology 
(ChEBI; [9]), the Uberon anatomy (or stage) ontology [10], and 
the Plant Ontology for plant structure/stage (PO; [11 ]). It also 
includes import modules that are minimal subsets of those ontolo- 
gies. This allows for cross-ontology queries, such as “find all genes 
that perform functions related to the brain” (e.g., in AmiGO: 
http://amigo.geneontology.org/amigo /term/UBERON:0000955# 
display-associations-tab). go-plus [12] also includes rules encod- 
ing biological constraints, such as the spatial exclusivity between a 
nucleus and a cytosol. These constraints are used for validation of 
the ontology and annotations [13]. This version of the GO ontol- 
ogy is available in OWL-RDF/XML. 

When working with the ontologies, the official language of the 
Gene Ontology is the Web Ontology Language, or OWL, which is a 
standard defined by the World Wide Web Consortium (W3C). The 
GO has approximately 41,000 terms covering over 4 million genes in 
almost 470,000 species [3]. Its organization goes beyond a simple 
terminology structured as a directed acyclic graph (DAG), as it con- 
sists of over 41,000 classes, but it also includes an import chain that 
brings in an additional 10,000 classes from additional ontologies 
([10] and see “go-plus’ above). In order to best represent the 
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Fig. 3 Visualizing a GO term using Protégé. Protégé displays the details of the term "adenine import across 
plasma membrane" (60:0098702). The underlying structure of the term is written in the OWL language, which 
adds flexibility to the expression of associations between genes and gene products and the terms in the ontol- 
ogy, compared to the possibilities offered in OBO. For example, in this term, inter-ontology logical definitions 
(OWL axioms) coming from the ChEBI ontology [9] are visible; this is not possible to see when visualizing the 
ontology using OBO 


complexity of these classes, along with the approximately 27 million 
associations that connect them to each molecular entity (genes or 
gene products), members of the GOC software development team 
worked on building an axiomatic structure for GO. That is, they 
assigned logical definitions (known as OWL axioms or self-evidently 
true statements) to all the classes; the Gene Ontology has been effec- 
tively axiomatized, that is, reduced to this system of axioms in OWL, 
and is highly dependent on the OWL tool stack [10]. Examples of 
OWL stanzas for terms that are defined by a logical definition in the 
Gene Ontology are available from GOC—Munoz-Torres (CA), 
2015 [3]. 

A number of tools, frameworks, and software libraries support 
OWL, including the ontology editor Protégé (http://protege. 
stanford.edu/; Fig. 3), the Java OWL API, and the OWLTools 
framework produced by the GO (https://github.com/owlcollab/ 
owltools). Figure 3 shows a GO term visualized using Protégé; its 
underlying structure is the OWL language. We also make the 
ontology editions available in OBO Format, which is a simpler for- 
mat used in many bioinformatics applications (note that “go-plus” 
is not available in OBO format). The two formats can be intercon- 
verted using the Robot tool produced by the GO Consortium, 
which can be found at https://github.com/ontodev/robot/. 
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3.2 Ontology 
Subsets 


3.3 Association Files 


The GO project is constantly evolving, and it welcomes feed- 
back from all users (see below in Subheading 5.3). Research groups 
may contribute to the GO by either providing suggestions for updat- 
ing the ontology (e.g., requests for new ontology terms) or by pro- 
viding annotations. Requests for new synonyms or clarification of 
textual definitions are also welcomed. 

Annotators and other data creators can search whether a term 
currently exists using the AmiGO browser at http://amigo. 
geneontology.org/, or may request new ones using either the GO 
issue tracker on GitHub or TermGenie. TermGenie ([14]; http:// 
termgenie.org) is a web-based tool for requesting new Gene 
Ontology classes. It also allows for an ontology developer to review 
all generated terms before they are committed to the ontology. 
The system makes extensive use of OWL axioms, but can be easily 
used without understanding these axioms. Users not yet familiar 
with TermGenie, or whom do not yet have permission to use 
directly, may submit ontology updates and requests using the GO 
curator request tracker on GitHub (https://github.com/geneon- 
tology/go-ontology/issues), which allows free-text form submis- 
sions. For more information on how to best contribute to the GO, 
please see Chap. 7 [15]. 


Gene Ontology subsets (also sometimes known as “slims” ) are cut- 
down versions of the ontologies, containing a reduced number of 
terms (e.g., species-specific subsets or more generic subsets with 
“useful” terms in various categories). They give a broad overview 
of the ontology content without the detail of the specific fine- 
grained terms. Subsets are particularly useful for giving a summary 
of the results of GO annotation of a genome, microarray, or cDNA 
collection when broad classification of gene product function is 
required. Further information, including Java-based tools and data 
downloads, is available from the GO website (http://geneontol- 
ogy.org/page /go-slim-and-subset-guide). 


The annotation process captures the activities and localization of a 
gene product using GO terms, providing a reference, and indicat- 
ing the kind of available evidence in support of the assignment of 
each term using evidence codes. Currently, the main format for 
annotation information in the GO is the Gene Association 
File (GAF, http://geneontology.org/page/ go-annotation-file- 
formats). This is the standardized file format that members of the 
Consortium use for submitting data. The annotation data is stored 
in tab-delimited plain text files, where each line in the file repre- 
sents a single association between a gene product and a GO term, 
with an evidence code, the reference to support the link between 
them, and other information. The GAF file format has several dif- 
ferent “flavors,” with 2.1 being the most current version. Additional 
details about GAF files is found in Chap. 3 [16]. 
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Recently, the GPAD/GPI files were developed, which are 
essentially a normalized version of GAF information. These for- 
mats are expected to have more prominence in the future, and 
further details about them can be found on the GO website 
(http: //geneontology.org/page/go-annotation-file-formats). 

Because they are tab-delimited text files, both the GAF and 
GPAD/GPI file formats are very amenable to mining with command 
line tools. As well, OWLTools can also be used to access this annota- 
tion information with operations such as: connecting the annotations 
to ontology information for exploration and reasoning, OWL trans- 
lation, validation, taxon checks, and link prediction. More advanced 
details on this topic are further explained on the OWLTools project 
wiki (https: //github.com/owlcollab/owltools/wiki). 

Details on how to make and evaluate GO annotations are dis- 
cussed in Chap. 4 [17] on “Best Practices in Manual Annotation 
with the Gene Ontology,” and in Chap. 8 [18] on “Evaluating 
Computational Gene Ontology Annotations.” Information is also 
available in the GO Annotation Guide (http://geneontology.org/ 
page /go-annotation-policies); more information on the meaning 
and use of the evidence codes in support of each annotation can be 
found on the GO Evidence Codes documentation (http:// 
geneontology.org/page /guide-go-evidence-codes). The GOC is 
currently transitioning from using evidence codes into implement- 
ing the Evidence Ontology (ECO) to describe the evidence in sup- 
port of each association between a gene product and a GO term. 
A detailed description of the Evidence Ontology and its use cases 
is included in Chap. 18 [19] on “The Evidence and Conclusion 
Ontology: Supporting Conclusions & Assertions with Evidence.” 


4 Making Your Own Tools 


In addition to using off-the-shelf tools provided by the GOC or 
other users, we also provide libraries and APIs to enable end-users 
to easily create their own tools for working with and analyzing 
GO data. 

Within the Java/JVM ecosystem, the OWLTools (https:// 
github.com/owlcollab/owltools), and OWL API (https:// 
github.com/owlcs/owlapi) libraries are the primary tools to work 
with the data. Since OWL is the internal representation format 
used by the GOC, standard OWL reasoners and tools are all usable 
with the data. For slightly less general access to the data, the 
OWLTools(-Core) wrapper library adds numerous helper methods 
to access OBO-specific fields (i.e., synonyms, alt. ids), walk graphs, 
create closures, and other common operations. 

On the JavaScript side (both client and server), AmiGO develop- 
ment has produced JavaScript APIs (http://wiki.geneontology.org/ 
index.php/AmiGO 2 Manual: JavaScript) and widgets (http:// 
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wiki.geneontology.org/index.php/AmiGO_2_Manual:_Widgets) 
for better access and integration with other tools. Users interested in 
using the JavaScript API or widgets from AmiGO in their own site 
should become familiar with the manager and response interfaces, 
which are the core of the JavaScript interface. An introductory over- 
view of the JavaScript API and widgets, as well as details on imple- 
mentation engines, the response class, and the configuration class can 
be also be found on the JavaScript section of the AmiGO Manual, 
listed above. 

As well, AmiGO provides methods for producing incoming 
searches to allow external sites to link to relevant information. 
Documentation about these methods can be found at http://wiki. 
geneontology.org/index.php/AmiGO 2 Manual: Linking. 


5 Additional Information 


5.1 Mappings 


5.2 Legacy 
Interface for GO 


5&3 Help/ 
Troubleshooting 
Software and Data 


The GO project provides mappings between GO terms and 
other key related systems (built for other purposes), such as 
Enzyme Commission numbers or Kyoto Encyclopedia of Genes 
and Genomes (KEGG). However, one should be aware that 
these mappings are neither complete nor exact and should be 
used with caution. A complete listing of mappings available for 
the resources of the GOC can be found at http://geneontology. 
org/page/download-mappings. Additional information about 
alternative and complementary resources to the GO is available 
on Chap. 19 [20]. 


Currently, the AmiGO and QuickGO interfaces have moved away 
from SQL database derivatives of the data sets. However, to sup- 
port legacy applications and queries, the GO data is regularly con- 
verted into an SOL database (MySQL). These builds can be 
downloaded and installed on a local machine, or queried remotely 
using the GO Online SQL/Solr environment (GOOSE; http:// 
amigo.geneontology.org/goose). More information about SOL 
access, including various downloads and schema information, can 
be found in the legacy SQL section of the GO website (http:// 
geneontology.org/page /lead-database-guide). 


In additional to other functions, the GO Helpdesk addresses user 
queries about the Gene Ontology and related resources. The GO 
Helpdesk will direct any questions or concerns with GO data, 
software, or analysis to the appropriate people within the consor- 
tium. You can directly contact the GO Helpdesk using the site 
form (http://geneontology.org/form/contact-go), which will 
automatically enter your query into an internal tracker to ensure 
responsiveness. 
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Chapter 12 


Semantic Similarity in the Gene Ontology 


Catia Pesquita 


Abstract 


Gene Ontology-based semantic similarity (SS) allows the comparison of GO terms or entities annotated 
with GO terms, by leveraging on the ontology structure and properties and on annotation corpora. In the 
last decade the number and diversity of SS measures based on GO has grown considerably, and their 
application ranges from functional coherence evaluation, protein interaction prediction, and disease gene 
prioritization. 

Understanding how SS measures work, what issues can affect their performance and how they compare 
to each other in different evaluation settings is crucial to gain a comprehensive view of this area and choose 
the most appropriate approaches for a given application. 

In this chapter, we provide a guide to understanding and selecting SS measures for biomedical 
researchers. We present a straightforward categorization of SS measures and describe the main strategies 
they employ. We discuss the intrinsic and external issues that affect their performance, and how these can 
be addressed. We summarize comparative assessment studies, highlighting the top measures in different 
settings, and compare different implementation strategies and their use. Finally, we discuss some of the 
extant challenges and opportunities, namely the increased semantic complexity of GO and the need for fast 
and efficient computation, pointing the way towards the future generation of SS measures. 


Key words Gene ontology, Semantic similarity, Functional similarity, Protein similarity 


1 Introduction 


The graph structure of the Gene Ontology (GO) allows the com- 
parison of GO terms and GO-annotated gene products by semantic 
similarity. Assessing similarity is crucial to expanding knowledge, 
because it allows us to categorize objects into kinds. Similar objects 
tend to behave similarly, which supports inference, a crucial task to 
support many applications including identifying protein-protein 
interactions [1], suggesting candidate genes involved in diseases [2 | 
and evaluating the functional coherence of gene sets [3, 4]. 
Semantic similarity (SS) assesses the likeness in meaning of two 
concepts. It has been a subject of interest to Artificial Intelligence, 
Cognitive Science, and Psychology for the last few decades, and an 
important tool for Natural Language Processing. It has been used 
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2 SS Measures 


in this context to perform word sense disambiguation, determining 
discourse structure, text summarization and annotation, informa- 
tion extraction and retrieval, automatic indexing, lexical selection, 
and automatic correction of word errors in text [5]. 

Sometimes, research literature uses SS, relatedness, and dis- 
tance as interchangeable terms, but they are in fact not identical. 
Semantic relatedness makes use of various relations between two 
concepts (ie., hyponymic, hypernymic, meronymic, antonymic, 
and any kind of functional relations including has-part, is-made-of, 
and is-an-attribute-of). SS is more limited since it usually only 
makes use of hierarchical relations, such as hyponymy/hyperon- 
ymy (1.e., 1s-a), and synonymy. Most authors support that semantic 
distance is the opposite of similarity, but it is sometimes also used 
as the opposite of semantic relatedness. 

The basis for much of the earlier research in SS is the WordNet, 
a large lexical database of the English language, freely available 
online. However, the last decade has witnessed an explosion in the 
number of applications of SS to biomedical ontologies, and specifi- 
cally in the GO [6]. The GO structure provides meaningful links 
between GO terms, based on the various relationships it estab- 
lishes. This structure allows us to capture the similarity between 
GO terms. In general, the closer two terms are in the GO graph, 
the more similar their meaning is. Moreover, we can also deter- 
mine the similarity between two GO-annotated gene products by 
expanding on this notion to compare sets of GO terms. This pro- 
vides a measure of the functional similarity between two proteins, 
which has numerous applications in biomedical research. 

The remainder of this chapter provides an overview of SS 
between GO terms and gene products annotated with GO terms, 
the different kinds of approaches used in this research area, the 
issues that affect their performance and evaluation and challenges 
and future directions. 


ASS measure can be defined as a function that, given two ontology 
terms or two sets of terms annotating two entities, returns a numer- 
ical value reflecting the closeness in meaning between them [7]. 
For a theoretical framework for SS measures please refer to [8], 
where the core elements shared by most SS measures are identified 
and a foundation for the comparison, selection, and development 
of novel measures is laid out. 

In the context of GO, SS measures can be applied to compute 
the similarity between two GO terms, term similarity, or to compute 
the similarity between two gene products each annotated with a set 
of GO terms, gene product similarity. 


21 Term Similarity 
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In recent years there have been several categorizations of SS 
measures [7, 9], and we advise readers to refer to both surveys for 
a more detailed classification and survey of SS measures and their 
applications. 


When considering SS between concepts organized in a taxonomy, 
as is the case of GO, there are two basic approaches: internal methods 
based on ontology structure and external methods based on external 
corpora. 

The simplest structural methods calculate distance between two 
nodes as the number of edges in the path between them [10]. If 
there are multiple paths, the shortest path or an average of all pos- 
sible paths can be used. For instance, in Fig. 1, the distance between 
heme binding and anion bindings 5. This measure depends only on 
the structure of the graph and it assumes that all semantic links have 
the same weight. Accordingly, SS is defined as the inverse score of 
the semantic distance. This edge-counting approach is intuitive and 
simple but disregards the depth of the nodes, since it considers 
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Fig. 1 Subgraph of GO covering the annotations of hemoglobin subunit alpha and hemocyanin Il proteins. 
The number of gene products annotated to each term in GOA (January, 2016) are indicated by n 
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paths of equal length to equate to the same degree of similarity, 
regardless if they occur near the root or deeper in the ontology. 
For instance, in Fig. 1, the classes transport and binding are at a 
distance of two edges, the same distance that separates iron ion 
binding and copper ion binding. 

To overcome this limitation of equal distance edges, some 
approaches give edges different weights to reflect some degree of 
hierarchical depth. It is intuitive that the deeper the level in the 
taxonomy, the smaller the conceptual distance, so weights are 
reduced according to depth. Other factors can be used to deter- 
mine weights for edges such as node density and type of link. 

However these methods have two important limitations, they 
rely heavily on the assumption that nodes and edges in an ontology 
are uniformly distributed and that nodes at the same level corre- 
spond to the same semantic distance, which are untrue in the case of 
GO. For instance, in Fig. 1, although oxygen binding and ion bind- 
ing are both at a depth of 2, the former is a more specific concept 
and is actually a leaf node. More recent approaches attempt at miti- 
gating some of these issues using for instance the depth of the lowest 
common ancestor (LCA) [11], distance to nearest leaf node [12], 
and depth of distinct GO subgraphs [1]. Related approaches, also 
based on the structure of the ontology, combine distance metrics 
with node structural properties, such as number of subclasses and 
distance to the lowest common ancestor between the terms [13]. 

External methods typically make use of information-theoretic 
principles. This type of approach has been demonstrated to be less 
sensitive or not at all to the issue of link density variability [14], 
i.e., that the ontology graph may be unbalanced and edges linking 
nodes may not be evenly distributed, so that the same depth or 
distance indicate a different level of specificity or similarity. 
Information content (IC)-based measures are based on the intu- 
ition that the similarity between two concepts can be given by the 
extent to which they share information. 

The IC of a concept c is a measure of how likely the concept 
is to occur, which can be quantified as the negative log likelihood, 
-log p(c) where p(c) is the probability of occurrence of c in a spe- 
cific corpus, usually estimated by the annotation frequency in the 
Gene Ontology Annotation database. A normalized version of IC 
was introduced in [15], whereby IC values are expressed in a 
range of uniformly scaled values, making them easier to interpret. 
Taking Fig. l again as an example, the frequency of annotation of 
binding is 750,325/1,948,009, making its IC 1.38 and its nor- 
malized IC 0.066. 

When the concept of IC is applied to the common ancestors 
two terms have, it can be used to quantify the information they 
share and thus measure their SS. There are two main approaches 
for doing this: the most informative common ancestor (MICA 
technique), in which only the common ancestor with the highest 


22 Gene Product 
Similarity 


Semantic Similarity in GO 165 


IC is considered [14]; and the disjoint common ancestors (DCA 
technique), in which all disjoint common ancestors (the common 
ancestors that do not subsume any other common ancestor) are 
considered. There are several methods to compute the DCA 
[16-18], which allow IC-based measures to take into account 
multiple common ancestors. 

Several measures have been used to measure the information 
shared by two GO terms. The simplest of these measures, Resnik's, 
takes the IC of the MICA as the similarity between two terms, and 
was among the first to be applied to GO [19]. The MICA of chlo- 
ride ton binding and iron ion binding is ion binding, making the 
Resnik similarity between these terms to be 0.066. Other measures 
combine the IC of terms with the IC of the MICA and weight 
them according to the MICA’s IC [20]. 

More recently, hybrid measures that combine both edge and 
IC-based strategies have been proposed [21]. Corpus-independent 
IC measures have also been proposed, based on number of descen- 
dants [22], depth and descendants [23] and on the notion of 
entropy [24]. 


Since gene products can be annotated with several GO terms 
within each of the three GO categories, gene product SS measures 
need to compare sets of terms rather than single terms. Several 
approaches have been proposed for this, most following one of two 
strategies: pairwise or groupwise. 

Pairwise approaches take the individual similarities between all 
terms annotating two gene products and combine them into a 
global measure of functional similarity. Any term similarity mea- 
sure can be applied with this strategy, where each gene product is 
represented by its set of direct annotations. Typical combination 
strategies include the average, maximum, or sum, and these can be 
applied to every pairwise combination of terms from the two sets 
or only the best-matching pair for each term. 

Groupwise approaches calculate gene product similarity directly 
by one of three approaches: set, graph, or vector. Set approaches 
consider only direct annotations and are calculated using set similar- 
ity techniques. Set-based measures are limited in that they do not 
take into account the shared ancestry between GO terms. Graph 
approaches represent gene products as the subgraphs of GO corre- 
sponding to all their annotations. Functional similarity is then calcu- 
lated either using graph-matching techniques or by less 
computationally intensive approaches such as set similarity. This 
approach takes into account all annotations (direct and inherited) 
providing a more comprehensive model of the annotations. Vector 
approaches represent gene products in vector space, with each term 
corresponding to a dimension, and functional similarity is calculated 
using vector similarity measures. Groupwise approaches can also 
make use of the IC of terms, by using it to weigh set similarity 
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computations, such as simGIC [15], which compares two sets of 
terms based on a IC-weighted Jaccard similarity; as scalar values 
in vectors, such as IntelliGO [25], which combines IC and the 
evidence content of annotations; or to compute the IC of shared 
subgraphs, such as the SS measure proposed in [14 ]. 


3 Issues and Challenges in SS 


Guzzi et al. [9] have identified several issues affecting SS measures, 
which they categorize into external issues, which are usually related 
to annotation corpora, and internal issues, inherent to the design 
of the measures. They do however recognize that both kinds of 
issues can be entangled, for instance when measures make errone- 
ous assumptions about the corpora. 

The most relevant external issues are the shallow annotation 
problem, the annotation length bias, and the use of Evidence 
Codes. The shallow annotation problem stems from the fact that 
many proteins are only annotated to very general GO terms, thus 
for instance two proteins can share 100% of their terms and still be 
very dissimilar. SS measures need to account for this issue, which 
can be especially relevant in the electronic annotations. Nevertheless, 
the quality and specificity of these annotations has been increasing 
over the years [26]. 

The annotation length bias refers to the positive correlation 
between SS scores and the number of annotations that some mea- 
sures produce. This is due to the fact that annotations are not uni- 
formly distributed among the proteins within an annotation corpus 
(and also vary among different organisms corpora), with some pro- 
teins being very well annotated while others have a single annota- 
tion. Both of these issues stem from incomplete annotations, which 
have been shown to have a significant impact in the performance of 
information-theoretic measures [27]. Finally, SS approaches need 
to be aware of the impact that using electronic annotations (evi- 
dence code IEA) can have.' Although in general the use of IEA 
annotations has a positive or null effect on the measures perfor- 
mance, in some cases and particularly when employing the maxi- 
mum combination approach over pairwise similarities it can have a 
detrimental effect and decrease the measure’s ability to capture 
similarity as conveyed by evaluation metrics [9, 17]. 

There are three levels at which internal issues can occur: term 
specificity, term similarity, and gene product similarity. At the term 
specificity level, both typically used approaches (term depth and 
IC) have their advantages and drawbacks. IC-based measures can 
be affected by the corpus bias effect [29] whereby rarely used but 
generic terms possess a high IC but are not biologically specific. 


1 ~ PS , . 
Please see Chap. 3 [28] for more information on evidence codes. 
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This issue is particularly relevant when using specific corpora that 
may be incomplete. Term depth measures on the other hand, 
while being independent of annotation corpora, are unable to 
handle the fact that terms at the same depth rarely have the same 
biological specificity, given the fact that GO’s regions have vary- 
ing node and edge density. 

At the term similarity level, distance-based measures suffer 
from the same issues as term depth term specificity. Moreover, 
since most measures rely on the concept of common ancestors to 
measure similarity between two terms, SS measures need to define 
the set of common ancestors over which similarity is computed. 
While the most informative common ancestor (or lowest common 
ancestor in the case of edge-based measures) is commonly used 
and usually provides good performance, it has been argued that 
measures taking into account all ancestors or a selection of them 
can more adequately portray the whole gamut of function. 

At the gene product similarity level, and in particular for pair- 
wise measures, special care needs to be taken when choosing a com- 
bination approach. The maximum approach is unsuitable to assess 
their global similarity, since it focuses on the single most similar 
aspect. The average approach, on the other hand, by making an 
all-against-all comparison of the terms of two gene products, 
produces counterintuitive results for gene products with multiple 
distinct functional aspects. For instance, two gene products both 
annotated with the same two unrelated terms, 77 and £2, will be 
50% similar under the average approach, because similarity will be 
calculated between both the matching (£1—£1,22-£2) and the oppo- 
site (£1—:2,:2-11) terms of the two gene products. The best-match 
approach would rely on comparing just (21—1,22-12), since these 
are the best-matching term pairs in the annotations set. The best- 
match average approach generally provides a better performance 
by considering all terms but only the most significant matches. 


4 Evaluating and Comparing SS Measures 


Evaluating the reliability of SS measures or determining the best 
measure for each application scenario is still an open question since 
there is no gold standard. Furthermore, each of the existing mea- 
sures formalizes the notion of function similarity in slightly differ- 
ent ways and for that reason it is not possible to define what the 
best SS measure would be, since it becomes a subjective decision. 
Ultimately, SS measures attempt to capture functional similarity 
based on GO annotations, so one possible solution is to compare 
SS measures to other measures or proxies of functional similarity. 
These include sequence similarity, family similarity, protein-pro- 
tein interactions, functional modules and complexes, and expres- 
sion profile similarity. Table 1 details the best performing measures 
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Table 1 


Best performing SS measures according to different protein similarity measures or proxies. Sequence, 
Pfam, and ECC similarity correspond to correlation evaluated using CESSM 


Similarity proxy or measure Best performing SS measures 


Sequence similarity 
Pfam similarity 
ECC similarity 


Expression similarity 


SSDD [13], SimGIC [15], HRSS [21] 

SORA [23], SSDD, SimGIC 

SSDD, HRSS, SORA 

TCSS [1], SimGIC, SimIC, Best-Match-Avg (Resnik [15 ]) 


Protein-protein interaction TCSS, SimIC, Max(Resnik) 


Results compiled from refs. 9, 13, 21, 23, 30 


for each aspect according to a recent survey of literature. Although 
more classic measures of SS such as Resnik still provide top results 
in some settings, it is the newer generation of measures that pro- 
vides the best results. And if until recently [9] GOA-based IC mea- 
sures were regarded as the best performing measures for most 
settings, the new wave of more complex structural-based measures, 
such as SSDD [13], SORA [23] and TCSS [1] are now on the 
lead, though closely followed by SimGIC. SSDD is based on the 
concept of semantic *totipotency" whereby terms are assigned val- 
ues according to their distance to the root and the number of 
descendants for each of the levels in that path, and then similarity 
corresponds to the smallest sum of *totipotencies" along a path 
between two terms. SORA uses an IC based on structural informa- 
tion that considers depth and number of descendants, and then 
applies set similarity to gene products. TCSS divides the GO graph 
into subgraphs and considers gene products more similar if they 
belong to the same subgraph. We postulate that the recent success 
of structural and hybrid measures, is not only due to their ability to 
more accurately capture the complexity of the GO graph, but also 
due to the evolution of GO itself, which has grown considerably 
since the “classic” measures were proposed. Linear correlation to 
sequence similarity is one of the most used measures, and in gen- 
eral a positive correlation between sequence and SS has been 
found, particularly on binned data. Nonlinear regression analysis 
found that the normal cumulative distribution fits data for many 
different SS measures, confirming the positive yet, nonlinear agree- 
ment between sequence and SS [15]. Linear correlation has also 
been used to compare SS to Pfam-based and Enzyme Commission 
Class similarity. 

One of the most relevant efforts in this area is the Collaborative 
Evaluation of Semantic Similarity Measures (CESSM) tool [30], 
which was created in 2009 to answer this need. It enables the 
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comparison of new GO-based SS measures against previously pub- 
lished ones considering their relation to sequence, Pfam, and 
Enzyme Commission Class (ECC) similarity. Since its inception, 
CESSM has been adopted by the community and used to evaluate 
several novel SS measures. 

The predictive power of SS measures in identifying protein— 
protein interactions is also commonly employed in SS evaluation 
[9]. In general SS measures are good predictors of PPI, but the 
most effective are groupwise or maximum combination approach 
measures. This is unsurprising given that proteins can interact 
when sharing a single functional aspect. 


There are two main kinds of available tools to compute SS mea- 
sures in GO: webservers, which typically provide easy to use solu- 
tions with fewer parametrizations possible; and software packages, 
which are more customizable, though more complex to use. 

Many of the recently proposed SS measures provide specific 
webservers, but some online tools provide a wider array of mea- 
sures, such as ProteInOn [31], FunSimMat [32], or GOssToWeb 
[33]. These tools rely on their own GO and GOA versions, and 
though they can output similarity scores with an input of just GO 
terms or Uniprot accession numbers, these scores are based on the 
tool’s ontology and annotation versions. 

If a user needs more control over the parametrization of the 
input data, then the best option is to employ a software package. 
Options include R packages (e.g., GoSemSim [34]) or standalone 
programs (GOssTo [33 ]), which give the user more freedom in terms 
of ontology and annotation versions as well as in programmatic access 
or the computation of SS for larger datasets. A Java library has 
been recently developed for ontology-based SS calculations [35], 
which includes over 50 different SS measures and accepts input 
ontologies in a number of formats, including OWL, OBO, and 
RDE. This library is well suited for large input datasets, being able to 
run over 100 million comparisons in under 1 h. In the case of webt- 
ools, we advise readers to check their update frequency to ensure that 
recent versions of GO and the annotations are in use. 


6 Challenges and Future Directions 


The last decade has witnessed a growing interest in GO-based SS, 
with dozens of new measures being proposed and applied in different 
settings. Although measures have become increasingly sophisti- 
cated, there remain several challenges and opportunities. 
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7 Exercises 


GO-based SS measures are inherently dependent on GO's 
development and its use in annotations. Measures should evolve 
with GO, striving to provide ever more accurate metrics for gene 
product functional similarity. In recent years there have been 
several developments of GO which SS measures are still not explor- 
ing. For instance, the different kinds of regulatory and occurrence 
relationships, the categorization of evidence codes, logical defini- 
tions and internal and external cross-products, can all in principle 
be explored by SS approaches. 

The need to provide more semantically sound measures of SS 
for biomedical ontologies has been argued [36], and though GO is 
commonly viewed as a DAG for a controlled vocabulary it is actu- 
ally well axiomatized in OWL [37]. The presence of these axioms 
should be considered by SS measures, and the exploration of dis- 
jointness in SS has been recently proposed in ChEBI [38]. 

In general, the computational complexity of SS measures has 
not been addressed. Current GO-based SS applications happen in 
an offline context where computational speed is not a relevant 
factor. However, for applications such as similarity-based search, 
which so far are based on precomputed similarities [32], perfor- 
mance should be taken into consideration. In addition, the growth 
in size of biomedical datasets spurred by genomic scale studies in 
the last few years, also places further computational constraints on 
SS measures. The challenge of handling very large datasets is 
increasingly recognized, and recent implementations of SS mea- 
sures allow for parallel computation [35], but the development of 
SS measures is not taking this issue into consideration a priori. 

The next generation of SS measures should take into account 
these two aspects, on one hand, the possibility for increased com- 
plexity in SS measures to provide more accurate similarity scores, 
and on the other the need for efficient SS computation, and strive 
to achieve a balance between increased accuracy and efficiency. 


Consider the subgraph of GO represented in Fig. 1 and the num- 
ber of annotations for each GO term it shows. 


1. Calculate the IC of the term “heme binding” considering that 
the total universe of annotations corresponds to the number of 
annotations to the root term. 


2. Transform the IC value calculated in 1 to a uniform scale [0,1]. 
Consider that the maximum IC is given to a term with a single 
annotated gene product, and an IC of zero corresponds to the 
IC of the root term, “molecular function.” 


3. Calculate the SS between the terms “chloride ion binding” 
and “iron ion binding,” and “oxygen transporter activity,” and 
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“tetrapyrrole binding,” following the minimum edge distance 
measure. 


4. Calculate Resnik’s SS between the same terms as in c. 


5. Calculate the similarity between the protein hemoglobin subunit 
alpha annotated with [ion iron binding, copper ion binding, pro- 
tein binding, heme binding, oxygen binding, oxygen transporter 
activity], and the protein hemocyanin II annotated with [chloride 
ion binding, copper ion binding, oxygen transporter activity]: 


(a) Using the average of all pairwise Resnik's similarities 
(b) Using the maximum of all pairwise Resnik's similarities 


(c) Using the simGIC measure, which corresponds to the 
ratio between sum of the IC of the shared terms between 
the two proteins and the sum of the IC of the union of all 
terms between the two proteins. 


(d) Compare the obtained results with your perception of the 
actual functional similarity between the two proteins. 
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Gene-Category Analysis 


Sebastian Bauer 


Abstract 


Gene-category analysis is one important knowledge integration approach in biomedical sciences that com- 
bines knowledge bases such as Gene Ontology with lists of genes or their products, which are often the 
result of high-throughput experiments, gained from either wet-lab or synthetic experiments. In this chapter, 
we will motivate this class of analyses and describe an often used variant that is based on Fisher’s exact test. 
We show that this approach has some problems in the context of Gene Ontology of which users should be 
aware. We then describe some more recent algorithms that try to address some of the shortcomings of the 
standard approach. 


Key words Enrichment, Overrepresentation, Knowledge integration, Fisher’s exact text, Gene prop- 
agation problem 


1 Introduction 


The result of biological high-throughput methods is often a list 
consisting of several hundreds of biological entities, which are in 
case of gene expression profiling experiments identifiers of genes or 
their products. As a biological entity may have different context- 
specific functions, it is difficult for humans to interpret the out- 
come of an experiment on the basis of such a list. Computational 
approaches to access the biological knowledge about features of 
biological entities therefore play an important part in the successful 
realization of research based on high-throughput experiments. A 
practical way to address the question of what is going on? is to per- 
form a gene-category analysis, i.e., to ask whether these responder 
genes share some biological features that distinguish them among 
the set of all genes tested in the experiment. 

First of all, gene-category analysis involves a list of gene catego- 
ries, in which genes with similar features are grouped together. The 
exact definition of the attribute similar depends on the provider 
of the categories. For instance, if Gene Ontology is the choice, 
then genes usually are grouped according to the terms, to which 
they are annotated. Another scheme is the KEGG database [1], 
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in which genes are grouped according to the pathways in which 
they are involved. The second ingredient is a statistical method for 
identifying the really interesting categories. 

In this chapter, we introduce some commonly used approaches 
for gene-category analysis. Throughout the remainder of this chap- 
ter, we refer to the set of items, which a study could possibly select, 
as the population set. We denote this set by the uppercase letter M 
while the size of the set, or its cardinality, is identified by its lower- 
case variant m. If, for example, a microarray experiment is con- 
ducted, the population set will comprise all genes whose expression 
can be measured with the microarray chip. The actual outcome of 
the study is referred to as the study set. It is denoted by Nand has the 
cardinality 7. In the microarray scenario the study set could consist 
of all genes that were detected to be differentially expressed. 


2 Fisher's Exact Test 


One approach for gene-category analysis is to cast the problem 
as a statistical test. For this purpose, the study set is assumed to 
be a random sample that is obtained by drawing » items without 
replacement from the population. The population is dichotomic 
as the items can be characterized according to whether they are 
annotated to term 7 or not. In particular, the set M, with cardi- 
nality m, constitutes all items that are annotated to 7. Denote 
the random variable that describes the number of items of the 
study set that are annotated to żin this random sample as X; 
The hypergeometric distribution applies to X,, and the proba- 
bly of observing exactly k items annotated to 7, i.e., P( X;- k) is 
specified by 


€ S al € t dd (" — ve) 
X, h(k|m; mi n) := P (X= k) = k n—k 


Furthermore, the set of items that are annotated to tand mem- 
bers of the study set are denoted by N, with cardinality z;. The 
objective is to assess whether the study set is enriched for term £, 
i.e., whether the observed wis higher than one would expect. This 
forms the alternative hypothesis Hj of the statistical test. The null 
hypothesis Hp in this case is that there is no positive association 
between the observed occurrence of the items in the study set and 
the annotations of the items to the term ¢. Thus, the proportion of 
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items annotated to term £is approximatively identical for the study 
set and the population set. In order to be able to reject Ho in sup- 
port of H, we conduct a one-tailed test, in which we ask for the 
probability of the event that we see n, or more annotated items 
given that Hp is true: 


m, (m-m, 
T k i = k ) 


2, m 
me 


If the probability obtained by this equation’ is below a certain 
significance level a, e.g., a<0.05, we reject Ho in favor of Hy. In 
that case, the tested term ¢ is regarded as an interesting term that 
contributes to the characterization of the study set. 


Example 2.1. Suppose that we are given a population of m = 18 
genes, of which m, = 4 genes are annotated to a term t. The outcome of 
an experiment yields a study set of 5 differentially expressed genes. A 
total of n, = 3 genes from the genes of the study set are annotated to t. 
Figure | illustrates the participating sets and how they are related to 
one another in that particular situation. 

In order to check whether term t can be used to characterize the 
experiment, we ask whether term t is overrepresented in the study set. 
The application of Eq.l yields a p-value for t 


(5) 2) 4) o 
à 3/42 4A 1 
tt = > = = 
Bp =P(X,23|H,) 18 + 18 0.044. 
5 5 
Thus, the null hypothesis is rejected and the term is said to be overrep- 


resented among the differentially expressed genes and is thus likely to 
reflect an association between the term and the experiment. 


3 Multiple Testing Problem 


In hypothesis-generating studies it is a priori not clear, which terms 
should be tested. Therefore, the procedure is not only conducted 
using a single term but also applied to many, often all terms that 
Gene Ontology provides and to which at least one gene is anno- 
tated. The result of the entire analysis is then a list of terms that 
were found to be significant. This, however, implies that the num- 
ber of false-positive terms is high. 


The superscript £f? in p,"* stands for term-for-term. It allows to distinguish 
this p-value with other measures that are described later. 
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population: m 


Fig. 1 Sets and their relations in the standard approach. In this example the 
population consists of m=18 genes and n=5 of them are part of the study set. 
Exactly m,=4 genes of the population are annotated to term t This term has 
n,=3 genes in common with the study set. The null hypothesis of the standard 
approach (term-for-term) is that there is no association between the number of 
genes that are in the study set and the number of genes that are annotated to the 
term 1, i.e., the study set is a random sample of the population set. We therefore 
would expect that it contains the same proportion of annotated terms as the 
population set does. The probability under the null hypothesis of the event to see 
at least nt genes can be assessed via Eq. 1. 


To see this, suppose that there are T'tests to be performed. We 
assume that the null hypothesis is true for all of those tests. Before 
its actual determination, any p-value can be considered as a random 
variable as well, for which P(pxa|H») <a holds [2]. This implies 
that it can be expected that a x T tests lead to the rejection of a null 
hypothesis although it is true. 


Example 3.1. If there are 10,000 null hypotheses that ave true and 
all of them are tested, then we expect that we reject the null hypotheses 
for about 500 tests. Obviously, describing the result of experiment with 
500 random terms is not useful. 

Therefore, the result of a term enrichment analysis shall be 
further subjected to a multiple test correction. The most simple is 
the Bonferroni correction [3]. Here, each p-value is simply multi- 
plied by the number of tests saturated at a value of 1.0. Bonferroni 
controls the so-called family-wise error rate, which is the 
probability of making one or more false discoveries. It is a very 
conservative approach because it handles all p-values as independent. 
But as we see later, this is not a typical case of gene-category analysis, 
so this approach often goes along with a reduced statistical power. 


4 Gene Propagation 
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In contrast, the Westfall-Young [4] procedure also takes depen- 
dencies into account. This correction, however, is computation- 
ally more costly as it is based on resampling schemes. In particular 
in the gene category setting, this scheme involves randomly sam- 
pling study sets of the same size as the original study set from the 
population. Each set is subjected to the test procedure yielding a 
set of p-values for each term, also referred to as the null distribu- 
tion of that term. By relating the original p-value to the null dis- 
tribution, an adjusted p-value is derived. There are other types of 
multiple test corrections that do not aim to control the family- 
wise error rate. For instance, the Benjamini-Hochberg [5] 
approach controls the expected false discovery rate (FDR), which 
is the proportion of false discoveries among all rejected null 
hypotheses. This has a positive effect on the statistical power at 
the expense of having less strict control over false discoveries. 
Controlling the FDR is considered by the American Physiological 
Society as “the best practical solution to the problem of multiple 
comparisons” [6]. 

Note that less conservative corrections usually yield a higher 
amount of significant terms, which may be not desirable after all. 
In the following section, we further explore the structural origin of 
the correlations of the p-values in the setting of enrichment tests 
for ontology terms. 


While the application of multiple testing correction aims to reduce 
the number of false-positives in a rather universal manner, one can 
also try to tackle the problem at a more basic level. The root of the 
problem is that if a term shares genes with a second term, and one 
of the terms is overrepresented, then it is not too surprising that 
the other term is also detected as overrepresented. 

That the gene sharing of terms of an ontology is more a rule 
than an exception can be deduced from the principles of how 
ontologies are designed. Within an ontology, terms describe con- 
cepts of a domain that can be related to other terms by various 
types of relationships. The most prominent relationship thereby is 
the zs 2 relationship, which effectively propagates the membership 
of the subject (source) of the relationship to the object (destina- 
tion). That means, if a term T is related to a term T, by the zs a 
relationship, and a gene is annotated to 7|, then it is implicitly 
annotated also to term 75 (see Chap. 1 [7]). In the context of GO 
overrepresentation analysis, we refer to this as the gene propagation 
problem? 


2 d EE . . : 

Note that in addition to this gene sharing that is due to the graph structure 
of the ontology, also unrelated terms can be annotated to similar sets of genes, 
for instance, if the same gene plays a role in distinct biological processes. 
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(x a parent of is a parent -Ə 


m= 4 m= 6 m = 18 
n= 3 Ns= 4 n=5 
p,— 0.044 p= 0.022 


Fig. 2 Extended example with three terms. This depicts the situation of 
Example 2.1 with two more terms. Term t is a s and therefore s is a parent of t. 
Term ris the root of the ontology. It is the only parent of s. As indicated in the last 
row, the procedure based on Fisher's exact test determines a p-value below 0.05 
for both terms. Thus, both terms will be considered as a meaningful summary of 
the underlying experiment. 


Example 4.1 (Continuation of Example 2.1). There is another 
term s, which is the only parent of t. For s we know that m, = 6 and 
n, = 4. Figure 2 shows this structure graphically. There, it is also indi- 
cated that the p-values of terms t and s are 0.044 and 0.022, respec- 
tively, which means that both terms are considered as significant for a 
< 0.05 if no multiple test correction is performed. Obviously, both 
terms share the majority of items that are also part of the study set. 
One can argue that the fact that term t is identified as overrepre- 
sented is a consequence of the fact that s is overrepresented. 

A simple synthetic experiment, in which a term will be artifi- 
cially overrepresented, demonstrates the extent of the problem. 
Let’s select the term localization for this purpose. We create a study 
set that consists of all genes that are annotated to that term with 
probability 0.8. This corresponds to false-negative rate 6=0.2. 
Furthermore, to introduce some background noise, each gene that 
is not annotated to the term is added to that study set with a false- 
positive rate of a=0.1. In this example, the procedure yields a set 
of 1542 genes. For each considered term, this set is subjected to 
Fisher's exact test resulting in a list of 4549 p-values’. Finally, the 
p-values are adjusted using the Bonferroni correction. 

The analysis correctly identifies the term /ocalizatzon as signifi- 
cantly enriched. In addition to that, it identifies 275 other terms as 
significantly enriched. In particular, 6 ofthe 6 children, to which at 
least one gene is annotated, are significant. Among the 681 possi- 
ble descendants of localization, we find 172 significant ones. These 
figures suggest that descendants come up only because their anno- 
tations converge in the term localization. Although, in the statistical 
sense, this is a correct result, it is not desirable to use that huge 
amount of terms to characterize the study set, especially as it is suf- 
ficient to use the term localization for this purpose, and what is 


Sy x : ee 
This corresponds to the number of terms from the biological process subon- 
tology that are annotated by at least one gene. 


Gene-Category Analysis 181 


more, the result suggests a specificity that we did not put in there. 
It makes sense to consider each of the additional 275 significant 
terms as a false-positive and in the next sections we will briefly 
describe methods that attempt to reduce that number. 


5 Parent-Child Approach 


The parent-child approach [8] is still based on Fisher’s exact test, 
but the probability of £ being overrepresented is conditioned on 
properties of the parental terms. In the following, let pa(z) be the 
set of parents of term £, which are, for instance, those terms, to 
which £ is connected by a 7s 2 relation. In order to introduce the 
principal ideas of the parent-child approaches, we initially assume 
that there is only a single parent of £, i.e., pa(£) = {s}. 

Instead of drawing the items from the population M, items will 
be drawn just from the set of items that are annotated to the parent 
of t, which is written as Mpat and whose size is mpat. This consid- 
eration yields the following equation: 


[s E =m, ) 
kA nan 
p 
Malt) 

The right part of Fig. 3 shows the setting of the parent-child 
approaches. Effectively, in the parent-child approaches, we change 
the population that underlies Fisher’s exact test to the items anno- 
tated to the parents. Obviously, this also alters the involved sets for 


the study set. As previously, we ask for the probability of seeing the 
observed number of items or a more extreme event: 


k ) less ~ ni 
min(m, ,Mpa(t)) k n -_ k 
pe pa(r) 
H = P(X, 2n, | Hy) = X m E (3) 
k-n, pa(t) 
| P ait) 


Example 5.1 (Continuation of Example 4.1). As shown in Fig.2, 

the parent of term s is the root v of the ontology, which is always anno- 

tated to all genes of the population. Therefore, the p-value for s is the 

same for previous approach and for parent-child approach, i.e., 
P= p, =0.22 However, for term t, Eq.3 yields 


PEOURC OM 
.(Q G 


P(X, - k|pa(r))- 


= P(X. >n 


t t 
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Fig. 3 Sets and their relations in the parent-child approaches. Part (a) depicts the model of the term-for-term 
approach as it was shown in Fig. 1. This is contrasted in part (b) with the model of the parent-child approaches. 
In this approach, we shift the focus to a smaller set of genes, for instance to the genes that are annotated to 
at least one of the parents of term t. In this particular situation it is the set whose size is my44 — 6 with pa(f = (s 
following Example 4.1. Genes that are not part of this set do not contribute to the calculation. This has an effect 
on the involved proportions, and thus on the outcome of the test. Effectively, for each term, we alter the popula- 
tion of the association test. Eq. 2 quantifies the probability. 


Thus, the null hypothesis for term t is not rejected, which is in contrast 
to the vesult of the previous approach. Given the initial observations 
that the study set is already skewed to the parent s of t makes the enrich- 
ment of term t less surprising, which the parent-child approaches 
reflect by returning a higher p-value. 

If term £ has more than one parent term, then it is not imme- 
diately apparent how to calculate m(t) and the observation »,,(¢) 
in Eqs.2 and 3. In Grossmann et al. [8] we examined two variants 
in detail, the union and the intersection of genes that are annotated 
to each of the parents. 


6 Topology-Based Algorithms 


Alexa et al. devised another method to address the gene propaga- 
tion problem. 'The authors propose calculating a score for the term 
that depends on the relevance of the children ofthe term [9]. They 
argue that capturing the meaning in that way is biologically more 
interesting as the definitions of children are more specific. Following 
this argumentation, the authors formulated two concrete algo- 
rithms that try to provide a more suitable, i.e., less correlated, dis- 
tribution of terms that get flagged as important. While the first 
approach which they called the e/im-algorithm strictly favors sig- 
nificance of the most specific levels of the GO graph, their second 
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algorithm called wezght relaxes this restriction such that terms that 
are most significant are favored. 

As before, we understand the top of the graph as the root of 
the ontology, while the bottom of the graph consists of the most 
specific terms. The idea of the elim algorithm is to traverse the 
graph representation of the ontology in bottom-up fashion, which, 
for instance, can be accomplished by utilizing the backtrack phase 
of a depth-first search (DFS) [10]. 

The elim procedure awaits a term £as a variable parameter and 
returns a set of flagged genes. On its initial invocation, it begins 
with the root of the ontology. For the current term £, we apply 
Fisher's exact test in order to relate the genes ofthe study set to the 
genes of the population with respect to the genes that are anno- 
tated to term t. As in the parent-child approaches, not all genes of 
the study set contribute to the calculation. For elim, a set of previ- 
ously determined genes is subtracted from the set of the study set 
before the calculation for p, is carried out. This set is constructed 
by recursively applying the elim procedure for all children of ¢ and 
taking the union of the result. If p, is significant, we add all genes 
of tto the set of flagged genes. Finally, we return the set of flagged 
genes to the caller. Note that when the DFS reaches a leaf node of 
the ontology, Fisher’s exact test is performed exactly as in the stan- 
dard approach. 

Obviously, the complexity of the algorithm is the same as the 
complexity of a depth-search algorithm if we assume that the num- 
ber of genes that are annotated to a term is constant. Note in the 
original publication of the elim, the algorithm was based on an 
iteration over the levels of the GO DAG, which partitions the 
nodes according to their longest distance to the root. The algo- 
rithm as outlined here yields an equivalent result without the need 
to explicitly keep track of the DAG levels. 


Example 6.1 (Continuation of Example 5.1). The p-value of 
term t matches the p-value of term t of the standard approach, i.e., 
pi" = p* = 0.044 . As this is a significant result, at least, if correc- 
tion for multiple testing is omitted, all four genes that are annotated 
to t are removed in the consideration of upper terms, i.e., we assume 
that those four genes are not annotated to them. This leaves two genes 
for the computation of term s, of which only one is member of the study 
set (Fig.3b). With m, = 2, n, = 1, and the rest as before, Eq. 1 yields 


pl” = P(X, LE 
5 5 


Hence, the elim method doesn’t report term s as important. 
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An equivalent characterization of the elim method is the fol- 
lowing: If a term ¢ is identified as significant, all genes that are 
annotated to £ are no longer considered in the computation of the 
relevance of the ancestors of t. As it was discussed in Example 2.1 
at page 2.1 and as can also be seen in Fig.2, the term-for-term 
approach assigns term sa lower p-value than it does for term z. One 
may conclude that it is more appropriate to take term s than to take 
term żin order to provide a compact description of the study set. 
However, in Example 13.6.1 we saw that the application of the 
elim method results in usage of term ¢ to describe the outcome, 
which is contrary to that conclusion. 

This concern is addressed by weight method. It compares sig- 
nificance scores of a family terms (a parent and its child) to identify 
the locally most significant terms and down-weight genes in less 
significant neighbors. This effectively decorrelates the p-values of 
the related terms such that their differences are enforced while the 
existence of the most significant terms is still maintained. 


7 Model-Based Approaches 


The previously described procedures that address gene propagation 
problem have in common that they successively test overrepresen- 
tation for each of the terms. They all use some form of the Fisher’s 
exact test. In contrast to this, model-based gene set analysis 
(MGSA) models the gene response in a genome-wide experiment 
as the result of an activation of a number of terms [11 ].* 

The approach is based on a model that can nicely be expressed 
using a Bayesian network with three layers of Boolean random 
variables. The term layer consists of m Boolean nodes correspond- 
ing to m terms of the ontology. A term can be active or inactive. 
A parameter p, usually much less than 0.5, represents the prior 
probability of a term being active. The hidden layer contains ” 
Boolean nodes representing the z hidden state of the genes. The 
hidden state of a gene is a consequence of the states the terms to 
which the gene is annotated: The gene is oz if and only if at least 
one term to which the gene is annotated is active, otherwise it is 
off. The third layer, the observed layer, contains Boolean nodes 
reflecting the experimentally observed state of all genes. For 
instance, in the setting of a microarray experiment, the om state 
would correspond to differential expression, and the off state 
would correspond to a lack of differential expression of a gene. The 
Observed gene state depends on the corresponding hidden gene 
state in a one-to-one fashion with a false-positive (a) and false- 
negative rates (f) that is identical and independent for all genes. A 
simple instance of the model is depicted in Fig.4. 


" -—- : 
We use the word term here because we primarily work with GO, but the 
method can be applied to any other structured or unstructured vocabulary. 
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Fig. 4 The graphical representation of an MGSA network. An example structure for 
four terms and three genes with a possible realizations is displayed. Terms (7;) 
that constitute the first layer can be either active (light) or inactive (dark). Terms 
that are active enable the hidden state (H;) of all genes annotated to them, the 
other genes remaining off. The observed states (0;) of the genes are noisy obser- 
vations of their true hidden state. In this example, the observed states for gene 1 
and 3 match the hidden state while for some unknown reasons the measurement 
of gene 2 doesn't correspond to the hidden state. It's a false-negative. 


The model describes how the activity of terms leads to the 
Observed stats of genes. This, however, is not the direction we are 
interested in. We are interested in the set of terms that explain the 
experimentally obtained data best, and the mathematical tool 
that can be applied to and such sets 1s probabilistic inference. The 
optimization problem that finds the term state configuration that 
explains the observed gene pattern best is NP-hard [12]. 
However, it is easily possible to find nearby solutions by sampling 
from the state space. This procedure additionally allows to deter- 
mine the so-called marginal probability for each term, which is a 
measure how good the particular term will explain the observed 
genes with respect to all the other terms. The value ranges 
between 0 and 1 with 0 being the lowest possible support and 1 
being the best possible support for a term. As all terms compete 
with one another, the inference takes dependencies both due to 
gene propagation and due to similarity of annotations into 
account. For example, if two unrelated terms are annotated to 
the same set of genes that matches the observation, the marginal 
probability for both terms will be 0.5. Consequently, it is advis- 
able to run MGSA for each of the subontologies separately as 
they are designed to express orthogonal features. 
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8 Gene Set Enrichment Analysis 


In addition to approaches that take a fixed subset of the population 
as input, procedures that take the measurements of the genes into 
account are also widely in use. This is attractive as it frees the inves- 
tigator from the need to define a sometimes arbitrary cutoff that is 
used to construct the study set. 

A first version of the so-called Gene Set Enrichment Analysis 
(GSEA) that received much attention of the scientific community 
was published by Mootha et al. [13]. In this approach, genes are 
ranked according to an interesting feature (e.g., the difference of 
the mean of their expression values for two experimental condi- 
tions). The null hypothesis is that the genes of the interesting set 
(e.g., genes annotated to a term) have no association with that list, 
in which case they would be randomly ordered. The alternative 
hypothesis is that the genes of the interesting set have an associa- 
tion. For instance, if the genes of the set are grouped together on 
the top of the list, we would tend to believe that there is such an 
association. 

To capture the association via statistical means, the authors 
proposed a normalized Kolmogorov-Smirnov (KS) test statistic. 
Let r;€ M be the gene of the population M that has rank żin the 
gene list that is sorted according to the interesting gene feature. 
Using the previously established notation, i.e., that m is the total 
number of genes and N, is the set of cardinality z; that contains 
only genes that are annotated to 7, the score is defined as: 


otherwise 


Thus, the score is the maximum of a running sum that is increased 
if the gene is annotated to ¢ and decreased if the gene is not anno- 
tated to £. In order to check if the obtained score is significant, the 
calculation is repeated for k randomly chosen sets Nj,..., Ni, 
which all are subsets of M with size n, The p-value for a term £ is 
calculated as 


|i ESQN?) > ESQN, )]| 
; 


p,- 


The GSEA method went a slight revision Subramanian et al. 
[14], where ad-hoc modifications are implemented that are sup- 
posed to countervail the well-known lack of sensitivity of the KS 
test [15, 16]. 


9 Software 


10 Exercises 
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Gene-category analysis is a very prominent use case of Gene 
Ontology. It shouldn’t come as a surprise that users can choose 
among a variety of software implementations that will perform this 
sort of analysis. For instance, current version of the web site of 
Gene Ontology Consortium (geneontology.org) provides access 
to the method of the basic Fisher’s exact test directly on the front 
page. There are also graphical tools that integrate into existing 
frameworks such as Bi NGO [17], standalone graphical clients such 
as Ontologizer? [18] or packages for Bioconductor such as topGo 
[19], mgsa [20], or 7CMAP [21], just to name a few of them. 


l. Repeat the random experiment outlined in the text that was 
used to show the influence of the gene propagation. When 
doing this in R/Bioconductor, it is advisable to use the GO.db 
and org.Sc.sgd.db packages that provide the structure and the 
annotations. The calculation involving the hypergeometric dis- 
tribution can be expressed directly in R using dhyper and phy- 
per. Now repeat this experiment with other approaches based 
on study sets that were outlined in this chapter and compare 
the results. For the topology-based algorithms the topGo pack- 
age can be used and for the model-based approach the mgsa 
package is well suited. 


2. Apply the approach now to an arbitrary example or on real 
world data. Compare the results. 
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Chapter 14 


Gene Ontology: Pitfalls, Biases, and Remedies 


Pascale Gaudet and Christophe Dessimoz 


Abstract 


The Gene Ontology (GO) is a formidable resource, but there are several considerations about it that are 
essential to understand the data and interpret it correctly. The GO is sufficiently simple that it can be used 
without deep understanding of its structure or how it is developed, which is both a strength and a weakness. 
In this chapter, we discuss some common misinterpretations of the ontology and the annotations. A better 
understanding of the pitfalls and the biases in the GO should help users make the most of this very rich 
resource. We also review some of the misconceptions and misleading assumptions commonly made about 
GO, including the effect of data incompleteness, the importance of annotation qualifiers, and the transitivity 
or lack thereof associated with different ontology relations. We also discuss several biases that can confound 
aggregate analyses such as gene enrichment analyses. For each of these pitfalls and biases, we suggest 
remedies and best practices. 


Key words Gene ontology, Gene/protein annotation, Data mining, Bias, Confounding, Simpson’s 
paradox 


1 Introduction 


As we have seen in previous chapters (for example refer to Chap. 1 
[1], Chap. 12 [2], Chap. 13 [3]), by providing a large amount of 
structured information, the Gene Ontology (GO) greatly facili- 
tates large-scale analyses and data mining. A very common type of 
analysis entails comparing sets of genes in terms of their functional 
annotations, for instance to identify functions that are enriched or 
depleted in particular subsets of genes (Chap. 13 [3]) or to assess 
whether particular aspects of gene function might be associated 
with other aspects of genes, such as sequence divergence or regula- 
tory networks. 

Despite conscious efforts to keep GO data as normalized as 
possible, it is heterogeneous in many respects—to a large extent 
simply because the body of knowledge underlying the GO is itself 
very heterogeneous. This can introduce considerable biases when 
the data is used in other analysis, an effect that is magnified in 
large-scale comparisons. 
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1.1 Simpson’s 
Paradox: The Perils 
of Data Aggregation 


1.2 The Inherent 
Incompleteness 

of the Gene Ontology 
(Open World 
Assumption) 


Statisticians and epidemiologists make a clear distinction 
between experimental data—data from a controlled experiment, 
designed such that the case and control groups are as identical as 
possible in all respects other than a factor of interest—and observa- 
tional data—data readily available, but with the potential presence 
of unknown or unmeasured factors that may confound the analy- 
sis. GO annotations clearly falls into the second category. Therefore, 
testing and controlling for potential confounders is of paramount 
importance. 

Before we go through some of the key biases and known 
potential confounders, let us consider Simpson's Paradox, which 
provides a stark illustration of the perils of data aggregation. 


Simpson's paradox is the counterintuitive observation that a statis- 
tical analysis of aggregated data (combining multiple individual 
datasets) can lead to dramatically different conclusions from analy- 
ses of each dataset taken individually, i.e., that the whole appears to 
disagree with the parts. Simpson's paradox is easiest to grasp 
through an example. In the classic *Berkeley gender bias case? [4 ], 
the University of California at Berkeley was sued for gender bias 
against women applicants based on the aggregate 1973 admission 
figures (44% men admitted vs. 35% women)—an observational 
dataset. The much higher male figure appeared to be damning. 
However, when individually looking at the men vs. women, admis- 
sion rate for each department, the rate was in fact similar for both 
sexes (and even in favor of women in most departments). The 
lower overall acceptance rate for women was not due to gender 
bias, but to the tendency of women to apply to more competitive 
departments, which have a lower admission rate in general. Thus, 
the association between gender and admission rate in the aggre- 
gate data could almost entirely be explained through strong asso- 
ciation of these two variables with a third, confounding variable, 
the department. When controlling for the confounder, the associa- 
tion between the two first variables dramatically changes. This type 
of phenomenon is referred to as Simpson's paradox. 

Because of the inherent heterogeneity of GO data, Simpson's 
paradox can manifest itself in GO analyses. This illustrates the 
importance of recognizing and controlling for potential biases and 
confounders. 


The Gene Ontology is a representation of the current state of 
knowledge; thus, it is very dynamic. The ontology itself is con- 
stantly being improved to more accurately represent biology across 
all organisms. The ontology is augmented as new discoveries are 
made. At the same time, the creation of new annotations occurs at 
a rapid pace, aiming to keep up with published work. Despite these 
efforts, the information contained in the GO database, that is, the 
ontology and the association of ontology terms with genes and 
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gene products, is necessarily incomplete. Thus, absence of evidence 
of function does not imply absence of function.’ This is referred to 
as the Open World Assumption [5, 6]. 

Associations between genes/gene products and GO terms 
(“annotations”) are made via various methods: some manual, 
some automated based on the presence of protein domains or 
because they belong to certain protein families [7]. Annotations 
can also be transferred to orthologs by manual processes [8], or 
automatically (e.g., [9, 10], reviewed in ref. 11). There are cur- 
rently over 210 million annotations in the GO database. Despite 
these massive efforts to provide the widest possible coverage of 
gene products annotated, users should not expect each gene prod- 
uct to be annotated. 

A further challenge is that the incompleteness in the GO is 
very uneven. Interestingly, the more comprehensively annotated 
parts of the GO can also pose challenges, presenting users with 
seemingly contradictory information (see Subheading 3.2). 

The inherent incompleteness of GO creates problems in the 
evaluation of computational methods. For instance, overlooking 
the Open World Assumption can lead to inflated false positive rates 
in the assessment of gene function prediction tools [6]. However, 
there are ways of coping with this uncertainty. For instance, it is 
possible to gauge the effect of incomplete annotations on conclu- 
sions by thinning annotations [12], or analyzing successive, 
increasingly complete database releases [13, 14]. 


2 Gene Ontology Structure 


One potential source of bias is that not all parts of the GO have the 
same level of details. This has a strong implication on measuring the 
similarity of GO annotations (Chap. 12 [2]). For instance, sister 
terms (terms directly attached to a common parent term) can be 
semantically very similar or very different in different parts of the 
GO structure, which has been called the “shallow annotation 
problem” (e.g., [15, 16]). This problem can partly be mitigated by 
the use of information-theoretic measures of similarity, instead of 
merely counting the number of edges separating terms, at the 
expense of requiring a considerable number of relevant annotations 
from which the frequency of co-occurrence of terms can be esti- 
mated (more details in Chap. 13 [3]). 


! Proteins whose function is uncharacterized are annotated to the root of the 
ontology, which formally means “this protein is associated with some molecu- 
lar function, biological process, or cellular component, but a more specific 
assertion cannot be made”. This annotation is associated with the evidence 
code “No biological Data available” (ND). The absence of annotation indi- 
cates that no curator has reviewed the literature for this gene product. 
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2.1 Understanding 
Relationships 
Between Ontological 
Concepts 


The GO is structured as a graph, and one pitfall of using the GO is 
to ignore this structure. Recall that each term is linked to other 
terms via different relationships (see Chaps. 1 [1] and 3 [17] for 
introductions to ontologies and GO annotations). These relation- 
ships need to be taken into account when using GO for data 
analysis. 

Some relationships, such as “is a” and “part of”, are transitive, 
which means that any protein annotated to a specific term is also 
implicitly annotated to all of its parents.’ An illustration of this is a 
“serine/threonine protein kinase activity”: it is a child of “protein 
kinase activity” with the relationship “is a”. The transitivity of the 
relation means that the association between the protein and the 
term “serine/threonine protein kinase activity” and all its parents 
has the same meaning: the protein associated with “serine /threo- 
nine protein kinase activity" has this function, and it also has the 
more general function “protein kinase." 

On the other hand, relations such as “regulates” are non- 
transitive. This implies that the semantics of the association of a 
gene to a GO term is not the same for its parent: if A is part of B, and 
B regulates C, we cannot make any inferences about the relationship 
between C and A. The same is true for positive and negative regula- 
tion. To illustrate, if we follow the term “peptidase inhibitor activ- 
ity” (GO:0030414) to its parents, one of the terms encountered is 
“proteolysis” via a combination of “is a”, “part of”, and “regulates” 
relations. However, a “peptidase inhibitor activity” does not mediate 
proteolysis, but quite the contrary (Fig. 1). Thus, any logical reason- 
ing on the ontology should take transitivity into account. 


proteolysis 


^ part of 


peptidase activity 


^ negatively regulates 
negative regulation molecular regulator 
of peptidase activity activity 


isa N Aisa 


peptidase inhibitor 
activity 


Fig. 1 Example of transitive (black arrows) and non-transitive (red arrow) rela- 
tionships between classes. A protein annotated to “peptidase inhibitor activity” 
term does not imply it has a role in “proteolysis,” since the link is broken by the 
non-transitive relation negatively regulates 


* With the exception of “NOT” annotations, for which the transitivity applies 
to children terms, not parents (see also Subheading 3.2). 


22 Inter-ontology 
Links and Their 
Impact 

on GO Enrichment 
Analyses 
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The relation “has part" is the inverse of “part of’, and connects 
terms in the opposite direction. Because of this, it generates cycles 
in the ontology. The relation “occurs in? connects molecular func- 
tion terms to the cellular components in which they occur. Thus, 
taking these relationships into account, it is possible to deduce 
additional cellular component annotations from molecular function 
annotations, without requiring additional experimental or compu- 
tational evidence. 

It important to know that there are three version of the GO 
ontology available: GO-basic, GO, and GO-plus.? Only the GO-basic 
file is completely acyclic. Therefore, applications requiring the 
traversal of the ontology graph usually assume that the graph is acy- 
clic; hence, the GO-basic file should be used. The different GO 
ontology files are discussed in more detail in Chap. 11 [18]. 


The “part of" relation, when linking terms across the different 
aspects ofthe Gene Ontology (molecular function to biological pro- 
cess, or biological process to cellular component, for instance), trig- 
gers an annotation to the second term, using the same evidence 
code and the same reference, but “GOC” as the source of the anno- 
tation (*field 15 of the annotation file, see (Chap. 3 [17] for a 
description of the contents of the annotation file). For example, a 
DNA ligase activity annotation will automatically trigger an anno- 
tation to the biological process DNA ligation. The advantage of 
having these annotations inferred directly from the ontology is that 
it increases the annotation coverage by making annotations that 
may have been overlooked by the annotator when making the pri- 
mary annotation. However, these inter-ontology links trigger a 
large number of annotations: there are currently 12 million annota- 
tions to 7 million proteins in the GO database. Changes in the 
structure of these links (as any change in the ontology), can poten- 
tially have a large impact on the annotation set. Indeed, Huntley 
et al. [19] reported that in November 2011, there was a decrease of 
~2500 manually and automatically assigned annotations to the term 
“transcription, DNA-dependent” (GO:0006351) due to the 
removal of an inter-ontology link between this term and the 
Molecular Function term “sequence-specific DNA binding tran- 
scription factor activity” (GO:0003700). Figure 2 shows the strong 
and sudden variation in the number of annotations with term 
“ATPase activity” (GO:0016887). 

Such large changes in GO annotations can affect GO enrich- 
ment analyses, which are sensitive to the choice of background 
distribution (Chap. 13 [3]; [20]). For instance, Clarke et al. [21] 
have shown that changes in annotations contribute significantly to 
changes in overrepresented terms in GO analysis. To mitigate this 
problem, researchers should analyze their datasets using the most 
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Annotation count of GO:0016887 over time 


1200 


-® Author 

=- Automatic 

+ Computational 
1000 = Curatorial 

=- Experimental 


Annotation Count 


2004 2006 2008 2010 2012 2014 
Date 


Fig. 2 Strong and sudden variation in the number of annotations with the GO 
term "ATPase activity" (GO:0016887) over time. Such changes can heavily affect 
the estimation of the background distribution in enrichment analyses. To mini- 
mize this problem, use an up-to-date version of the ontology/annotations and 
ensure that conclusions drawn hold across recent releases. Data and plot 
obtained from GOTrack (http://www.chibi.ubc.ca/gotrack) 


up-to-date version of the ontology and annotations, and ensure 
that the conclusions they draw hold across multiple recent releases. 
At the time of the writing of this chapter, DAVID, a popular GO 
analysis tool, had not been updated since 2009 (http://david. 
abcc.ncifcrf.gov/forum/viewtopic.php?f-10&t-807). Enrichment 
analyses performed with it may thus identify terms whose distribu- 
tion has substantially changed irrespective of the analysis of inter- 
est. The Gene Ontology Consortium now links to the PantherDB 
GO analysis service (http://amigo.geneontology.org/rte) [22]. 
This tool uses the most current version of the ontology and the 
annotations. Regardless of the tool used, researchers should dis- 
close the ontology and annotation database releases used in their 
analyses. 
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3 Gene Ontology Annotations 


3.1 Modification 
of Annotation Meaning 
by Qualifiers 


3.2 Negative 
and Contradictory 
Results 


Having discussed common pitfalls associated with the ontology 
structure, we now turn our attention to annotations. Understanding 
how annotations are done is essential to correctly interpreting the 
data. In particular, the information provided for each GO annota- 
tion extends beyond the mere association of a term with a protein 
(reference to Chap. 3 [17]). The full extent of this rich informa- 
tion, aimed to more precisely reflect the biology within the GO 
framework, is often overlooked. 


The Gene Ontology uses three qualifiers that modify the meaning 
of association between a gene-product and a Gene Ontology term: 
These are “NOT”, “contributes to”, and “co-localizes with” (see 
documentation at http://geneontology.org/page/go-qualifiers). 

The “contributes to” qualifier is used to capture the molecular 
function of complexes when the activity is distributed over several 
subunits. However, in some cases the usage of the qualifier is more 
permissive, and all subunits of a complex are annotated to the same 
molecular function even if they do not make a direct contribution 
to that activity. For example, the rat G2 /mitotic-specific cyclin-B1 
CCNB1 is annotated as contributing to histone kinase activity, 
based on data in [23], although it has only been shown to regulate 
the kinase activity of CDKI. Finding a cyclin annotated as having 
protein kinase activity may be unintuitive to users who fail to con- 
sider the “contributes to” qualifier. 

The “co-localizes with” qualifier is used with two very different 
meanings: it first means that a protein is transiently or peripherally 
associated with an organelle or complex, while the second use is for 
cases where the resolution of an assay is not accurate enough to say 
that the gene product is a bona fide component member. 
Unfortunately, it is currently not possible to know which of the 
two meanings is meant in any given annotation. 


The “NOT” qualifier is the one with the most impact, since it 
means that there is evidence that a gene product does not have a 
certain function. The “NOT” qualifier is mostly used when a spe- 
cific function may be expected, but has shown to be missing, either 
based on closer review of the protein’s primary sequence (e.g., loss 
of an active site residue) or because it cannot be experimentally 
detected using standard assays. 

The existence of negative annotations can also lead to apparent 
contradictions. For instance, protein ARR2 in Arabidopsis thaliana 
is associated with “response to ethylene” (GO:0009723) both 
positively on the basis of a paper by Hass et al. [24] and negatively 
based on a paper by Mason et al. [25]. The latter discusses this 
contradiction as follows: 
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3.3 Annotation 
Extensions 


Hass et al. [24] reported a reduction in the ethylene sensitivity of 
seedlings containing an arr2 loss-of-function mutation. By contrast, 
we observed no significant difference from the wild type in the seed- 
ling ethylene response when we tested three independent arr2 inser- 
tion mutants, including the same mutant examined by Hass et al. 
[24]. This difference in results could arise from differences in growth 
conditions, for, unlike Hass et al. [24], we used a medium containing 
Murashige and Skoog (MS) salts and inhibitors of ethylene 
biosynthesis. 


Thus, in this case, the contradiction in the GO is a reflection of 
the primary literature. As Mason et al. note, this is not necessarily 
reflective of a mistake, as there can be differences in activity across 
space (tissue, subcellular localisation) and time (due to regulation), 
with some of these details not fully captured in the experiment or 
in its representation in the GO. 

A NOT annotation may also be assigned to a protein that does 
not have an activity typical of its homologs, for instance the 
STRADA pseudokinase (UniProtKB:Q7RTN6); STRADA adopts 
a closed conformation typical of active protein kinases and binds 
substrates, promoting a conformational change in the substrate, 
which is then phosphorylated by a “true” protein kinase, STK11 
[26]. In this case, the *NOT" annotation is created to alert the 
user to the fact that although the sequence suggests that the pro- 
tein has a certain activity, experimental evidence shows otherwise. 

In contrast to positive annotations, “NOT” annotations 
propagate to children in the ontology graph and not to parents. 
To illustrate, a protein associated with a negative annotation to 
*protein kinase activity" is not a tyrosine protein kinase either, a 
more specific term. 


As also described in Chap. 17 [27], the Gene Ontology has recently 
introduced a mechanism, the *annotation extensions", by which 
contextual information can be provided to increase the expressivity 
of the annotations [28]. Until recently, annotations had consisted 
of an association between a gene product and a term from one of 
the three ontologies comprising the GO. With this new knowledge 
representation model, additional information about the context of 
a GO term such as the target gene or the location of a molecular 
function may be provided. 

Common uses are to provide data regarding the location of the 
activity/process in which a protein or gene product participates. For 
example, the role of Mouse opsin-4 (MGI:1353425) in rhodopsin 
mediated signaling pathway is biologically relevant in retinal gan- 
glion cells. Annotation extensions also allow capture of dynamic 
subcellular localization, such as the S. pombe birl protein 
(SPCC962.02c), which localizes to the spindle specifically during 
the mitotic anaphase. The annotation extensions can also be used 
to capture substrates of enzymes, which used to be outside the 
scope of GO. 


3.4 Biases 
Associated 
with Particular 
Evidence Codes 
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The annotation extension data is available in the AmiGO [29] 
and QuickGO [30] browsers, as well as in the annotation files 
compliant with the GAF2.0 format (http://geneontology.org/ 
page /go-annotation-file-gaf-format-20). | However, because 
annotation extensions are relatively new, guidelines are still being 
developed, and some uses are inconsistent across different data- 
bases. Furthermore, most tools have yet to take this information 
into account. 

In effect, extensions of an annotation create a “virtual” GO 
class that can be composed of more than one “actual” GO class, 
and can be traced up through multiple parent lineages. Thus, just 
as with inter-ontology links, accounting for annotation extensions 
can result in a substantial inflation in the number of annotations, 
which needs to be appropriately accounted for in enrichment anal- 
yses and other statistical analyses that require precise specification 
of GO term background distribution. 


Annotations are backed by different types of experiments or analyses 
categorized according to evidence codes (Chap. 3 [17]). Different 
types of experiments provide varying degrees of precision and confi- 
dence with respect to the conclusions that can be derived from them. 
For most experiment types, it is not possible to provide a quantita- 
tive measure of confidence. Evidence codes are informative but can- 
not directly be used to exclude low-confidence data.* Nonetheless, 
the different evidence codes are prone to specific biases. 


Direct evidence. Taking these caveats into account, the evidence 
code inferred from direct assay (abbreviated as IDA in the annota- 
tion files) provides the most reliable evidence with respect to the 
how directly a protein has been implicated in a given function, as it 
names implies. 


Mutant phenotype evidence. Mutants are extremely useful to impli- 
cate genes products in pathways and processes; however exactly how 
the gene product is implicated in the process/function annotated is 
difficult to assess using phenotypic data because such data are inher- 
ently derivative. Therefore, associations between gene products and 
GO terms based on mutant phenotypes (abbreviated as IMP in the 
annotation files) may be weak. The same caveat applies to annota- 
tions derived from mutations in multiple genes, indicated by evi- 
dence code “inferred from genetic interaction” (IGI). 


Physical interactions. Evidence based on physical interactions (IPI; 
mostly protein-protein interactions) is comparable in confidence 
to a direct assay for protein binding annotations or for cellular 
components; however for molecular functions and biological 


“An evidence confidence ontology has been proposed by Bastien et al. [31] 
but has yet to be adopted by the GO project. 
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processes, the evidence is of the type “guilt by association” and is 
of low confidence. Inferences based on expression patterns (IEP) 
are typically of low confidence. The presence of a protein in a spe- 
cific subcellular localization, at a specific developmental stage, or 
associated with a protein or a protein complex can provide a hint 
to uncover a protein’s role in the absence of other evidence, but 
without more direct evidence that information is very weak. 


High-throughput experiments. Schnoes et al. [32] reported that 
annotations deriving from high-throughput experiments tend to 
consist of high-level GO terms, and tend to represent a limited 
number of functions. This artificially decreases the information 
content of these terms, since they are frequently annotated, and 
artificially decreased information content affects similarity analyses. 
This potentially has a large impact, since a significant fraction of the 
annotations in the GO database are derived from these types of 
analyses (as much as 25 %, according to Schnoes et al., who used the 
operational definition of a high throughput paper as one in which 
over 100 proteins were annotated). The GO does not currently 
record whether particular experimental annotations may be derived 
from high-throughput methods, but this may change in the future. 


Biases from automatic annotation methods. The GO association file, 
containing the annotations, has information regarding the method 
used to assign electronic annotations. The annotations can be 
assigned by a large number of different methods. Examples include 
domain functions, as assigned for example by InterPro, by Enzyme 
Commission numbers being associated with an entry, by BLAST, 
by orthology assignment, etc. Note that this information is not 
provided as an evidence code, but as a “reference code”. The list of 
methods and their associated reference code is available at http:// 
www.geneontology.org/cgi-bin/references.cgi. The large number 
of electronic annotations can also make them have a dispropor- 
tionate impact on the results. Most analysis tools allow for the 
inclusion or exclusion of electronic annotations, but not at the 
more fine-grained level of the particular method. It is nevertheless 
possible to use the combination of evidence code plus reference 
(available at: http://www.geneontology.org/cgi-bin/references. 
cgi) to automatically deepen the evidence type, see https://raw. 
githubusercontent.com/evidenceontology/evidenceontology/ 
master/ gaf-eco-mapping.txt). 

Note that a gene or gene product can have multiple annota- 
tions to the same term but with different evidence. This can pro- 
vide corroborating information on particular genes, but may also 
require appropriate normalization in statistical analyses of term 
frequency, as the frequency of terms that can be determined 
through multiple types of experiments may be artificially inflated. 
Furthermore, because different experiments can vary in their 
specificity—thus resulting in annotations at different levels of 


3.5 Differences 
Among Species 


3.6 Authorship Bias 
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granularity for basically the same function—this redundancy only 
becomes conspicuous when the transitivity of the ontology structure 
is appropriately taken into account. 

For more discussion on evidence codes, and their use in quality 
control pipelines, refer to Chap. 18 [33]. 


There can be substantial differences in the nature and extent of 
GO annotations across different species. For instance, zebrafish is 
heavily studied in terms of developmental biology and embryo- 
genesis while the rat is the standard model for toxicology. These 
differences are reflected in the frequency of GO terms across spe- 
cies, which can vary considerably across species [34]. This has 
important implications on enrichment analyses and other statisti- 
cal analyses requiring a background distribution of GO annota- 
tions. For instance, consider an experiment trying to establish the 
biological processes associated with a particular zebrafish protein 
by identifying its interaction partners and performing an enrich- 
ment analysis on them. If we naively use the entire database as 
background, the interaction partners might appear to be enriched 
in developmental genes simply because this class is over-repre- 
sented in general in zebrafish. Instead, one should use zebrafish 
gene-related annotations only as background [20]. 


Other biases are less obvious but can nevertheless be strong and 
thus have a high potential to mislead. Recently, sets of annotations 
derived from the same scientific article were shown to be on aver- 
age much more similar than annotations derived from different 
papers (Fig. 3; [34]). For instance, Nehrt et al. compared the 


a b 
Authorship bias Propagated annotation bias 
average GO Similarity average GO Similarity 
0.0 0.2 0:4 0.6 0.0 0.2 0.4 0.6 


Same paper Experimental 


Different paper, common author Curated 


Different Authors Uncurated 


Fig. 3 (a) Average GO annotation similarity (using the measure of Schlicker et al. [35] between homologous 
genes, considering experimental annotations partitioned according to the provenance; (b) Average GO annota- 
tion similarity between homologous genes, partitioned according to their GO annotation evidence tags 
(Experimental: evidence code EXP and subcategories; Uncurated: evidence code IEA; Curated: all other evidence 
codes). Figure adapted from ref. 34 
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3.7 Annotator Bias 


3.8 Propagation Bias 


functional similarity of orthologs (genes related through speciation) 
across different species and paralogs (genes related through dupli- 
cation) within the same species, and observed a much higher level 
of functional conservation among the latter [36]. However, this 
difference was almost entirely due to the fact that the GO func- 
tional annotations of same-species paralogs are ~50 times more 
likely to be derived from the same paper than orthologs; when 
controlling for authorship and other biases, the difference in func- 
tional similarity between same-species paralogs and orthologs van- 
ished and even became in favor of orthologs [34]. 

Note that the difference is smaller but remains significant if we 
compare annotations established from different papers, but with at 
least one author in common, with annotations from different arti- 
cles with no author in common. 


Just as systematic differences among investigators can lead to the 
authorship bias, systematic differences in the way GO curators capture 
this information can lead to annotator bias. These annotator biases 
can in part be attributed to different annotation focus, but also to dif- 
ferent interpretation or application of the GO annotation guidelines 
(http: //geneontology.org/page/go-annotation-policies). 

UniProt provides annotations for all species, which allows us 
to assess the effect of annotator (or database) bias. If we compare 
UniProt annotations for mouse proteins with those done by the 
Mouse Genome Informatics group (MGI), we see that comparable 
fractions of proteins are annotated using the different experimental 
evidence codes, with mutant phenotypes being the most widely 
used (78% of experimental annotations in MGI, versus 63% in 
UniProt), followed by direct assays (20% of annotations in MGI 
and 32% in UniProt). 

However when we look at which GO terms are annotated based 
on phenotypes (IMP and IGI) by the two groups, we notice a large 
difference in the terms annotated. The top term annotated by MGI 
supported by the IMP evidence code is “in utero embryonic devel- 
opment”, with 1170 annotations to 1020 proteins. UniProt has 
only 4 annotations for this term. On the other hand, UniProt has as 
one of its top-annotated classes “regulation of circadian rhythm”, 
for 49 annotations to 38 proteins; 96 annotations for 69 proteins if 
we also include annotations to more specific, descendant terms. 
MGI on the other hand, only has 18 annotations for 19 proteins. 
This indicates that the annotations provided by different groups are 
biased towards specific aspects, and are not a uniform representation 
of the biology of all gene products in a species. 


Another strong and perhaps surprising bias lies in the very different 
average GO similarity between electronic annotations compared 
with between experimental annotations. Indeed, if we consider 


3.9 Imbalance 
Between Positive 
and Negative 
Annotations 
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homologous genes, their similarity in terms of electronic annotations 
tend to be much higher than in terms of experimental annotations, 
with curated annotations lying in-between ([34]; Fig. 3). A likely 
explanation for this phenomenon is that electronic annotations are 
typically obtained by inferring annotations among homologous 
sequences, a process that can only increase the average functional 
similarity of homologs. 

Because of this homology inference bias, one must exercise 
caution when drawing conclusions from sets of genes whose anno- 
tations might have different proportions of experimental vs. elec- 
tronic annotations. For instance, this would be the case when 
comparing annotations from model organisms with those from 
non-model organisms (the latter being likely to consist mostly of 
electronic annotations obtained through propagation). 

More subtly, because function conservation is generally 
believed to correlate with sequence similarity, many computational 
methods preferentially infer function among phylogenetically close 
homologs. This bias can thus confound analyses attempting to 
gauge the conservation of gene function across different levels of 
species divergence. 


As discussed above, both our knowledge of gene function and its 
representation in the GO remain very incomplete. We have already 
discussed the pitfalls of ignoring this fact altogether (closed vs. 
open world assumption), or assuming similar term frequencies 
across species. But the extent of missing data varies along other 
dimensions as well: for example it can depend on how easy it is to 
experimentally establish a particular function and how interesting 
the potential function might be. The problem is particularly acute 
in the case of negative annotations, because they can be even more 
difficult to establish than their positive counterparts (e.g., a nega- 
tive result can also be due to inadequate experimental conditions, 
differences in spatiotemporal regulation, etc.) and they are often 
perceived as being less useful, and certainly less publishable. As a 
result, currently less than 1% of all experimental annotations are 
negative ones in UniProt-GOA [37 ]. This imbalance causes prob- 
lems with training of machine learning algorithms [38]. Rider 
et al. [39] investigated the reliability of typical machine learning 
evaluation metrics (area under the “receiver operating characteris- 
tic" (ROC) curve, area under the precision-recall curve) under 
different levels of missing negative annotations and concluded 
that this bias could strongly affect the ranking obtained from the 
different metrics. Though this particular study adopted a closed 
world assumption, the effect of a varying proportion of negative 
annotations is likely to be even greater under the open world 
assumption. 
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4 Getting Help 


This chapter provides a broad overview of some of the pitfalls associ- 
ated with GO-based analysis. Table 1 summarizes the most impor- 
tant pitfalls users encounter using GO. 

Users are advised to make use of a number of excellent 


resources provided by the GO consortium: 


Table 1 


Main pitfalls or biases discussed in the chapter and their remedies 


Pitfall or bias 


Remedy 


Wrongly assume that absence of 
annotation implies absence of function. 


Not all directed edges in the ontology 
structure have the same meaning: 
depending on their type, the 
relationship they represent may or may 
not be transitive. 


To yield meaningful results, GO 
enrichment analyses require accurate 
specification of the background 
distribution, which can vary 
substantially across releases, species, etc. 


Inter-ontology links and annotation 
extensions can result in large variations 
in the number of annotations. 
Furthermore, annotation extensions 


may not be consistently implemented, if 
at all, across analyses tools or workflows. 


Qualifiers such as “NOT” or “co-localizes 
with” are important parts of a gene 
annotation in that they fundamentally 
change the meaning of annotations. 
Because only a small minority of all 
annotations have qualifiers, such errors 
can easily go unnoticed. 


Annotations are supported by different 
types of evidence (categorized by 
evidence codes). The annotations 
associated with each code vary in their 
scope, specificity, and number. These 
differences can confound some analyses. 


Account for the fact that both ontology and annotations 
are necessary incomplete, for instance by assessing the 
impact of incompleteness on one’s analyses and findings. 


The transitivity of each type of relations must be taken 
into account when reasoning over the GO. “Is a” and 
“part of” are transitive, but “regulates” is not. 


Specify the actual background distribution used in the 
analysis of interest. Short of this, ensure that the 
enrichment analysis is performed on consistent database 
release and subsets of species, terms, etc. To test the 
robustness of results, consider repeating the analysis 
using several releases of GO ontology/annotation 
databases. Avoid tools that are not regularly updated. 


Keep track of database releases in analyses. If they are 
relevant, make sure that annotation extensions are 
implemented consistently. 


Remember to take into account qualifiers. When using 
tools or software libraries, make sure that these take 
qualifiers into account as well. 


Take evidence code into account. In statistical analyses, 
consider the distribution of annotations in terms of 
evidence codes, and, if needed, control for this potential 
confounder. 


(continued) 
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Table 1 
(continued) 
Pitfall or bias Remedy 
Different species tend to have very When performing statistical analyses or using information- 
different types of annotations. For theoretic similarity measures, use species-specific 


instance, model species have many more frequencies of GO term. 
experiment-based annotations. 


Experiment-based annotations derived Control for authorship bias in analyses that may have 
from the same research article tend to varying proportion of annotations stemming from the 
be more similar than annotations same article, lab, or annotation team. 


derived from different articles. Similar 
trends hold for annotations derived 
from same versus different authors, and 
same versus different annotators. 


Because annotations are preferentially Restrict such analyses to experiment-based annotations. 
propagated among closely related Avoid circularity. 
sequences, electronic annotations can 
confound analyses seeking to 
characterize relationships between 
evolution and function. 


There are many more positive annotations Consider false-positive and false-negative rates separately. 
than negative annotations. As a result, Focus on subset of data for which the class imbalance 
standard accuracy measures used by problem is less pronounced. 
machine learning methods may be 
misleading (“class imbalance problem”). 


e The GO website http://geneontology.org 
e The GO FAQ http://geneontology.org/faq-page 


e The GO team are eager to help with your problems: e-mail go- 
help@geneontology.org 


e The wider bioinformatics community can be consulted via sites 
like Biostars—see the GO tag https://www.biostars.org/t/go/ 


* The GO community can be contacted on Twitter at @news4go 


5 Conclusion 


This chapter surveys some of the main pitfalls and biases of the 
Gene Ontology. The number of potential issues, summarized in 
Table 1, may seem daunting. Indeed, as discussed at the start of this 
chapter, there are some inherent risks in working with observational 
data. However, simple remedies are available for many of these 
(Table 1). By understanding the subtleties of the GO, controlling 
for known confounders, trying to identify unknown ones, and 
cautiously proceeding forward, users can make the most of the 
formidable resource that is the GO. 
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Visualizing GO Annotations 


Fran Supek and Nives Skunca 


Abstract 


Contemporary techniques in biology produce readouts for large numbers of genes simultaneously, the 
typical example being differential gene expression measurements. Moreover, those genes are often richly 
annotated using GO terms that describe gene function and that can be used to summarize the results of 
the genome-scale experiments. However, making sense of such GO enrichment analyses may be challeng- 
ing. For instance, overrepresented GO functions in a set of differentially expressed genes are typically 
output as a flat list, a format not adequate to capture the complexities of the hierarchical structure of the 
GO annotation labels. 

In this chapter, we survey various methods to visualize large, difficult-to-interpret lists of GO terms. 
We catalog their availability—Web-based or standalone, the main principles they employ in summarizing 
large lists of GO terms, and the visualization styles they support. These brief commentaries on each soft- 
ware are intended as a helpful inventory, rather than comprehensive descriptions of the underlying algo- 
rithms. Instead, we show examples of their use and suggest that the choice of an appropriate visualization 
tool may be crucial to the utility of GO in biological discovery. 


Key words Gene Ontology, Visualization, Interpretation, Redundancy, Enrichment, Tools 


1 Introduction 


We have entered the era of massive data sets in biology. A variety 
of experimental and computational techniques can produce read- 
outs for many genes—or whole genomes—simultaneously. 
Moreover, we can also assign rich functional annotations to most 
of the genes of interest. Such a wealth of data is accompanied with 
challenges in interpretation. 

In this chapter, we focus on methods that visualize long lists of 
Gene Ontology (GO) terms [1]. The methods we survey take as 
input a flat list of GO terms, often accompanied by some user- 
supplied measure of statistical significance or importance. 
Visualization methods summarize such lists to distil the most rele- 
vant information. Finally, these methods produce various styles of 
visualization that can aid interpretation. 


Christophe Dessimoz and Nives Skunca (eds.), The Gene Ontology Handbook, Methods in Molecular Biology, vol. 1446, 
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First, we examine the challenges related to understanding large 
lists of GO terms; second, we provide a systematic overview of the 
published methods that address these challenges; third, we discuss 
different visualization styles these methods use; and fourth, we 
give usage examples for a selection of these tools. 


2 Understanding Large Lists of Genes and Their Gene Ontology Labels 


2.1 Challenges 
in Interpreting Lists 
of Enriched GO Terms 


A classical example of a large biological dataset are gene expression 
measurements by RNA-Seq, which monitor the genome-wide 
changes in transcriptional regulation between experimental condi- 
tions. Typically, tens or hundreds of genes will be upregulated or 
downregulated in response to a particular treatment. This indicates 
that a systems-level change in the experimental model has occurred, 
which may be described by examining the common properties of 
the genes whose expression was altered. Do these genes participate 
in the same metabolic or signaling pathways? Do they perform 
similar biochemical functions? Do their protein products co- 
localize in the cell? Formally, such sets of genes are subjected to 
statistical tests for enrichment for various functional categories [2]. 
The gene functions tested are typically described by Gene Ontology 
(GO) terms [3], although alternatives such as KEGG Pathways or 
CORUM protein complexes can be used. 

Of note, such GO enrichment analyses are by no means restricted 
to experiments measuring changes in gene expression, nor to experi- 
mental data in general. Any list of genes for which interpretation is 
sought can be described using enriched GO terms and it could, for 
instance, derive from comparative genomics. In particular, one could 
perform an evolutionary analysis to look at biological roles of gene 
families that have expanded in a certain eukaryotic lineage, e.g., [4 ]. 
Similarly, a researcher may wish to describe the overall functional 
repertoire in a newly sequenced genome, while comparing to exist- 
ing genomes of related organisms. 


As Chap. 3 [5] describes, the GO is a hierarchical structure, wherein 
the individual terms can have not only multiple descendants, but 
also multiple parents; more formally, GO is a directed graph; the 
basic version of the GO is also a directed acyclic graph (Chap. 3! 
[5]; Fig. 1). This complex structure, along with its large size—the 
GO has thousands of nodes—make it challenging to display the 
part(s) of the GO of interest. For instance, a list of GO terms found 
to be enriched in a gene expression experiment could be concen- 
trated in one part of the GO graph. 

A further complication is that such lists of interesting GO terms 
tend to be large, meaning that many different biological processes or 
molecular functions may appear to be affected in the experiment. 


j http://geneontology.org/page/download-ontology 
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Fig. 1 A subset of the Gene Ontology Directed Acyclic Graph (DAG) for the GO 
term “vesicle fusion” (G0:0006906). The GO is a DAG: terms are nodes, while the 
relations are edges. Two main relation types between terms are “is_a” and 
“part_of.” More specific terms are found deeper in the graph. Thus, if a gene 
product is annotated with a GO term, it is by definition also annotated with all the 
parent terms of that GO term 


part_of is_a 


One reason for this is that the GO itself is designed and developed to 
describe nuances in gene function as exhaustively as possible; conse- 
quently, many of the GO terms will be partially redundant. For instance, 
many of the genes participating in “translation” (GO:0006412) are 
also structurally a part of “ribosome” (GO:0005840). 

In addition to the inherent redundancy of the GO, responses of 
biological systems to experimental perturbation often genuinely 
involve coordinated activity of many related and/or overlapping 
subsystems. For example, replicating cells facing DNA damage may 
upregulate “nucleotide-excision repair" (GO:0006289) to help fix 
the lesions, but at the same time resorting to “error-prone transle- 
sion synthesis” (GO:0042276) to ensure DNA replication finishes. 


GO term enrichment analyses often result in lists of significant 
GO terms that are both long and redundant, hampering interpre- 
tation. Various methods to visualize such lists may help investiga- 
tors spot dominant trends in the data, leading to novel biological 
insight. Such visualizations mostly operate by different ways of 
grouping and displaying similar GO terms together, wherein the 
structure of the GO defines what is similar and what is not (see 
semantic similarity analysis below). In its simplest form, this 
involves displaying a part of the GO hierarchy with the GO terms 
of interest highlighted and their parent-child relationships 
shown. Displaying also the user-supplied experimental data may 
help prioritize which GO terms, among many similar ones, are of 
higher interest. 

We suggest that having an unbiased way to algorithmically 
organize GO terms derived from experimental data helps prevent 
unintentional biases in interpretation. If unaware of the overall 
semantic structure in the set ofsignificant GO terms, the investigator 
may pick one or two GO terms in the list that “make sense,” in 
terms of fitting with their expectations. By visualizing the interre- 
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lationships between the GO terms alongside the statistical support 
for each in the experimental data could help avoid focusing on 
outlying—and perhaps spurious—results. In addition, one could 
be made aware of the common pitfall where one GO term is cho- 
sen, while other similarly statistically supported terms are ignored. 
Finally, and very importantly, a good visualization is also an effec- 
tive means of presenting summaries of scientific results, whether in 
Papers, presentations or posters. 


3 Overview of the GO Visualization-Related Tools 


Here we systematize and describe the currently available tools for 
visualizing sets of GO annotations. Additionally, we highlight three 
of these tools in more detail. The tools and the underlying meth- 
ods they implement can be classified thusly: 


l. Interactive GO browsers. Tools for interactively browsing the 
entire GO and also the genes known to be annotated with 
chosen GO terms. Importantly, these do not take into account 
a user-supplied set of annotations of interest, e.g., derived from 
an enrichment analysis of experimental data. Visualization is 
typically not emphasized and not configurable. See AmiGO 
[6] and QuickGO [7]. Of note, OLSVis [8] can display other 
biomedical ontologies in addition to the GO. 


2. Network visualization tools. These are not particular to the 
GO, but can display any kind of graph, including the GO ora 
part thereof. The visualization options are highly configurable; 
however, since these tools were not designed specifically for 
GO, they tend to be more complicated to use. See Cytoscape 
[9], Gephi [10], and Pajek [11]. 

(a) Of note, there are Cytoscape plugins specialized for han- 
dling groups of GO terms: EnrichmentMap [12] and 
BINGO [13]. 


3. GO visual overlays. Tools that can visualize an interesting sub- 
set of the GO, and display some additional data about each 
shown GO term. Typically, this involves coloring the GO terms 
by the enrichments or p-values determined from user-supplied 
gene lists (these tools tend to also perform the GO enrichment 
analysis). They display the terms arranged by parent-child rela- 
tionships, in a tree-like visual layout. Examples include GOrilla 
[14], GRYFUN [15], GOFFA [16], and SimCT [17]. 

(a) In addition to the GO, similar tools are available which can 
highlight the individual members in displayed KEGG 
pathways [18 |; the pathways can also be shown in a KEGG 
BRITE functional hierarchy with FuncTree [19]. 

4. Semantic similarity analysis. Tools that examine the semantic 
similarity (redundancy) between various GO terms, including 
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those that are not linked by direct parent-child relationships. 
The similarities are used to organize a set of interesting GO 
terms into clusters and/or graphs, while simultaneously allow- 
ing highly redundant terms to be filtered out. The user can 
supply enrichments or p-values to prioritize results. 
Implemented in REVIGO [20] and RedundancyMiner [21 ]. 


(a) Some provisions for this are made in g:Profiler [22 ], which 
collapses similar GO terms. 


(b) The Ontologizer [23] can perform a statistical test for 
enrichment that accounts for the parent-child redundancy 
[24] prior to visualizing results. 


5. Emerging methods. These may involve display of the trends 
underlying a group of GO terms in a so-called “tag cloud” 
(with text in various colors and sizes), or in a tree map (a hier- 
archical organization of colored tiles), as in REVIGO [20] or 
GOSummaries [25]. Additionally, several tools now support 
the display of multiple GO enrichment analyses side-by-side; 
see BACA [26] or GOSummaries. SimCT [17] can display 
subtrees of other biomedical ontologies in addition to the GO. 


4 Case Studies with Selected Tools 


GOrilla [14] is a Web-based tool that can take two types of input: 
either a ranked list of genes or two lists, one with the target genes and 
the other with the background genes. As output, GOrilla produces a 
visualization that indicates which terms are significantly enriched. 

We focus here on the enrichment analysis that takes a ranked 
list of genes. Briefly, the null hypothesis is that the occurrences of 
a GO term at various points in the ranked list are equiprobable. 
Lower p-values indicate a higher confidence for a GO term to be 
enriched towards the top of the list. 

As an example analysis, we downloaded a dataset of transcrip- 
tion profiling by microarray of human peripheral blood 
mononuclear cells after a treatment with Staphylococcus aureus and 
incubation for different lengths of time [27], obtained from the 
Gene Expression Atlas [28] at http://www.ebi.ac.uk/gxa/ 
experiments /E-GEOD-16837. In the GOrilla Web interface 
(http: //cbl-gorilla.cs.technion.ac.il/), we set the p-value threshold 
to 105, and the remaining settings were the defaults in the tool. 

The display is shown in Fig. 2. Based on the color ofthe boxes, 
the user can visualize which GO terms are enriched, and the con- 
necting lines describe their relationship to other terms in the GO 
graph. 

REVIGO [20] analyzes large lists of significant GO terms and 
removes the redundant terms, in order to further narrow the search 
to a set of nonredundant and highly significant GO terms. Briefly, 
REVIGO creates clusters of GO terms that are semantically similar, 
and selects one representative for each cluster. 
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Fig. 2 A visualization of the Biological Process Gene Ontology annotations using GOrilla. The dataset used is a 
microarray transcription profiling of human peripheral blood mononuclear cells after treatment with 
Staphylococcus aureus (Expression Atlas dataset ID E-GEOD-16837). The GOrilla settings were left at default 
values: p-value threshold of p< 10-°, organism Homo sapiens and running mode "single ranked list" 


One possible input for REVIGO is a list of GO terms with the 
associated p-values, such as the output list from GOrilla. 
Alternatively, REVIGO can take as input any other list of GO terms, 
with or without associated numerical values, and provide various 
styles of visualization. First, a scatterplot that distributes the GO 
terms, represented as bubbles, in a 2D space that will put two GO 
terms closer together if they are more semantically similar. Second, 
an interactive graph that connects the user-supplied set of GO terms 
based on the structure of the GO hierarchy. Third, a TreeMap 
where terms are clustered and clusters displayed as colored tiles. 
Fourth, REVIGO provides a word cloud that highlights the most 
frequent keywords in the names and descriptions of the GO terms. 

To perform the analysis, we used the setting in GOrilla to 
automatically forward its GO term enrichment results as a query 
to the REVIGO tool. In REVIGO, we used the default settings. 


» 


Fig. 3 (continued) while its size reflects the generality of the GO term in the UniProt-GOA database. (b) The 
table view shows the list of all the input GO terms: those shown in the scatterplot are written in regular font, 
while those labeled as redundant by REVIGO are shown in gray italics 
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Fig. 3 Visualizations of Biological Process GO annotations using REVIGO: scatterplot and table views. The data- 
set used was imported from GOrilla (see legend of Fig. 2). We used the default settings of the REVIGO tool. 
(a) The scatterplot view visualizes the GO terms in a “semantic space” where the more similar terms are 
positioned closer together [20]. The color of the bubble reflects the p-value obtained in the GOrilla analysis, 
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Fig. 4 Visualizations of Biological Process GO annotations using REVIGO: TreeMap (a), interactive graph (b) and 
word cloud views (c). The dataset used was imported from GOrilla (see legend of Fig. 2). We used the default 
settings of the tool 
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The results are shown in Figs. 3 and 4. The various visualization 
styles highlight the GO terms that are enriched in the input 
dataset. 

RedundancyMiner [21] is another tool that focuses on non- 
redundant terms in a large list of enriched GO terms, producing a 
Clustered Image Map (CIM) as a result. It is a part of a larger 
pipeline: RedundancyMiner relies on GOminer input and on CIM 
miner for visualization. In particular, RedundancyMiner performs 
Fisher’s exact tests for each pair of GO terms in the datasets, calcu- 
lating whether the two sets of genes annotated with these GO 
terms are overlapping. A symmetrical matrix of these p-values is 
subsequently analyzed to arrive to a set of GO terms that are most 
independent, and therefore least redundant. 

To perform the analysis, we started with the same file as 
for the two tools described above. First, we generated two files 
using a custom Python script: (1) a file containing all the genes 
in the array and (2) a file containing the genes that are over or 
underexpressed, labeled with “1” or “-1,” respectively. Of note, 
Python is not necessary for RedundancyMiner and these files 
could be generated otherwise. Second, we put these files as input 
for the GOminer tool (http://discover.nci.nih.gov/gominer/ 
GoCommandWebInterface.jsp). We selected the databases that 
contain Homo sapiens data and as the organism we set H. sapi- 
ens. The remaining parameters were the defaults in the tool. 
Third, we used the resulting folder as the working folder for 
RedundancyMiner and we ran the analysis in default mode. Finally, 
we visualized the resulting CIM file using cimMiner [29], avail- 
able at http://discover.nci.nih.gov/cimminer/home.do, in single 
matrix mode. 

Results of our example analysis are shown in Fig. 5. Even with 
the stringent threshold of requiring the log, fold change greater 
than 5, similar trends in significant GO terms are visible as shown 
with the remaining two tools. 


5 Choice of Visualization 


Above, we have outlined some of the currently available software 
tools that can visualize a set of GO terms. We have also argued that 
a good visualization is an effective means of discovering underlying 
trends in the data in an unbiased fashion; an appropriate visual dis- 
play is also imperative when communicating the results to others. 
The question of which software tool to apply should be addressed 
keeping these goals in mind. A related yet distinct question is which 
specific visualization method to choose. Here, we give a summary 
ofthe available options. Of note, the authors of this text are also the 
developers of REVIGO [20], a versatile visualization tool, which 
implements several of the approaches listed below. 


216 Fran Supek and Nives Skunca 


mr 
25 5 05 
Euclidean distance. 


014495 


E 
8 


P04179 - 


DTI. 


GO:0061041 regulation of wound healing 


GO:0009605 response to external stimulus 


GO:0006935 chemotaxis-|-GO:0042330 taxis 


GO:0040011 locomotion 


GO:0042221 response to chemical stimulus 


GO:0009611 response to wounding 


GO:0006954 inflammatory. response 


GO:0002376 immune system process 


a S8 $ E 8 e ge 

D 3 S g g $ 8 

a S & a a a a 
m 


Fig. 5 A visualization of a set of Biological Process GO annotations using RedundancyMiner. The dataset used 
is a microarray transcription profiling of human peripheral blood mononuclear cells after treatment with 
Staphylococcus aureus. For this visualization, we focus on genes that had log2 fold change greater than 5 


1. Graphs/networks. The GO graph consists of nodes (here, Gene 


Ontology terms) and edges (here, parent-child relation- 
ships), which connect the nodes and which have directional- 
ity. Nodes and edges can have multiple attributes that can be 
visualized. For instance, the enrichment of a GO term in a 
user’s experiment may be shown as a color of a node (Fig. 2). 
Importantly, the spatial arrangement of the nodes on the 
final plot is called a /ayout, and is often created to suggest 
related clusters of nodes by placing similar nodes closer 
together. Such approaches are reviewed and demonstrated 
by Merico et al. [30]; tools like Cytoscape [9] support a vari- 
ety of visual layouts. 


(a) A special case of a layout is a tree-like display that high- 
lights the ‘levels’ in the Gene Ontology and the parent- 
child relationships between terms (e.g., Fig. 2). These 
levels (determining the depth of a node in the graph) are 
often used as a measure for how general the GO term is. 
However, this may be misleading in some instances—for 
example, the Molecular Function ontology is more shal- 
low than the Biological Process ontology—and we there- 
fore recommend the use of the information content (IC) 
measure [31] for this purpose. This is defined as the negative 
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logarithm of the relative frequency of the respective term 
annotations in some underlying database, such as the 
UniProt-GOA [32]. 

2. Semantic similarity space. Various mathematical methods mea- 
sure the semantic similarity between pairs of GO terms, such as 
SimRel [33]; see ref. 34 for a review. If the first term in a pair 
is a direct parent, child or sibling of the second term, their 
semantic similarity will be very high. However, also the more 
distantly related terms will show some degree of similarity, as 
long as they reside in a common branch of the GO tree struc- 
ture. Many such pairwise similarities within a group of GO 
terms can be processed by a projection technique, such as prin- 
cipal components analysis (PCA) or multidimensional scaling. 
The resulting plots preserve as much of the original pairwise 
distances as possible, while showing all supplied GO terms in a 
two-dimensional plane. The main visualization in REVIGO is 
based on this approach (Fig. 3a). 


3. Treemaps. Hierarchical diagrams consisting of tiles subdivided 
into smaller tiles. Treemaps are good for interactive explora- 
tion, as they can be ‘zoomed in’ by clicking a tile and revealing 
finer levels of subdivisions. Here, tiles can be GO terms and 
the subdivisions their child terms. The tile sizes may corre- 
spond to some measure of importance of GO terms to the 
user, such as enrichment or p-values. REVIGO has an imple- 
mentation of this visualization approach (Fig. 4b). 


4. Word clouds. A display with text shown in various sizes and pos- 
sibly colors. Here, the individual words or short phrases may be 
the names of the GO terms or some keywords associated to the 
GO terms. The text size/color may convey the importance to 
the user (enrichment), or in some instances generality of a GO 
term (see information content above). This visualization method 
is implemented in GOSummaries and REVIGO (Fig. 4c). 


5. Clustered Heatmaps. Two-dimensional grids of values, wherein 
the rows and/or columns are clustered to reveal the ‘block 
structure’ in the data. Clustered heatmaps are often used for 
showing high-dimensional data in biology, but rarely so for 
GO terms. In fact, this could be done to show the GO terms’ 
similarity based on what genes are annotated to them, or on 
the terms’ semantic similarity (which is defined by the struc- 
ture of the GO graph). An example implementation can be 
found in RedundancyMiner (Fig. 5). 


In addition to the above, many of the tools specializing in GO 
enrichment testing (or in other analyses of large-scale biological 
data) often come bundled with visualizations that include GO as 
an important context. Examples include the Bioconductor pack- 
ages GOexpress, GOfunction and GOSim. In addition, it is often 
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possible to customize such displays in more detail by manually 
passing the GO data to a dedicated visualization software, such as 
the ggplot2 package [35] in R, or to gnuplot software. For example, 
a specialized software to draw treemaps can be made to display GO 
enrichments from a biological experiment via a script that prepares 
the data in a correct format [36]. REVIGO will draw bubble charts 
where the GO terms are displayed in a semantic similarity space 
[20], and it can export a ggplot2 script which is further customiz- 
able for e.g., font sizes, colors, and line styles; it can similarly export 
a graph to be further customized in Cytoscape. 


6 Concluding Remarks and Outlook 


In summary, we outline several tools that biologists can use to visual- 
ize sets of Gene Ontology terms and uncover novel and interesting 
trends in their experimental data. We anticipate that the future will 
bring even more massive biological data sets, which will have several 
consequences. First, the lists of interesting GO terms will grow in 
length, as larger sample sizes afford more statistical power to detect 
associations. Therefore, refinements of the existing approaches that 
address redundant GO terms [20, 21] will come in useful. Second, 
the visualization software will need to deal with more than a single 
list ofenriched GO terms. While some current tools can display such 
results from multiple experiments side-by-side, e.g. BACA [26], 
tools will be needed that can integrate such lists and extract patterns 
across them. Finally, while GO is a prominent example of an ontol- 
ogy used by biologists, it is far from the only one [37 ]—over 100 
biomedical ontologies exist that describe environments, phenotypes, 
and chemical entities (see Chap. 19) [38]. We foresee substantial 
developments in the tools that can summarize and visualize results of 
various biological experiments in the context of such emerging 
ontologies. 
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A Gene Ontology Tutorial in Python 


Alex Warwick Vesztrocy and Christophe Dessimoz 


Abstract 


This chapter is a tutorial on using Gene Ontology resources in the Python programming language. 
This entails querying the Gene Ontology graph, retrieving Gene Ontology annotations, performing gene 
enrichment analyses, and computing basic semantic similarity between GO terms. An interactive version of 
the tutorial, including solutions, is available at http://gohandbook.org. 


Key words Gene Ontology, Tutorial, Python 


1 Introduction 


One of the main goals of developing a formal ontology is to facili- 
tate computational analysis. The purpose of this chapter is to pro- 
vide a hands-on introduction to handling GO terms and GO 
annotations in Python. This tutorial also shows how Python can be 
used to perform GO term enrichment analyses, as well as how to 
compute the similarity between GO terms. 

This tutorial uses Python, but other popular languages com- 
monly used to perform GO analyses include Java, R, Perl, and 
Matlab. The Gene Ontology consortium website maintains a list of 
software libraries, accessible from 


ftp://ftp.geneontology.org/pub/go/www/GO.tools_by_type. 
software.shtml 


An interactive version of this tutorial, with model solutions to 
all the questions, is available from the book homepage at http:// 
gohandbook.org. 


2 Querying the Gene Ontology 


A fundamental first step is to retrieve the Gene Ontology and anal- 
yse that structure (Chap. 3 [1]). 
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One convenient Python package available to query the GO is 
GOATOOLS [2]. This package can read the GO structure stored 
in OBO format, which is available from the GO website (see 
Chap. 11 [3]). After loading this file, it is possible to traverse the 
GO structure, search for particular GO terms, and find out which 
other terms they are related to and how. 

This package is available on the Python Package Index (PyPI), 
a standard repository of python libraries. As such, it is possible to 
install it locally using the command!: 


pip install goatools 


The GOATOOLS package contains the functions necessary to 
parse the GO in OBO format, to query it, and to visualise the 
ontology. Using the function obo parser.GODag() from 
GOATOOLS, the GO file can be loaded. Each GO term in the 
resulting object is an instance of the GOTerm class, which contains 
many useful attributes, such as: 


e  GOTerm.name: textual definition; 


e GOTerm.namespace: the ontology the term belongs to (1.e., 
Molecular Function [MF], Biological Process [BP], or Cellular 
Component [CC]); 


e GOTerm.parents: list of parent terms; 
e GOTerm.children: list of children terms; 


e GOTerm. level: shortest distance to the root node; 


Exercise 2.1 

Download the GO basic file in OBO format (go-basic.obo), and 

load the GO using the function obo parser.GODag() from 

GOATOOLS. Using this library, answer the following questions: 

a) What is the name of the GO term GO:0048527? 

What are the immediate parent(s) of the term GO:0048527? 

What are the immediate children of the term GO:0048527? 

d) Recursively find all the parent and child terms of the term 
GO:0048527. Hint: use your solutions to the previous two ques- 
tions, with a recursive loop. 

e) How many GO terms have the word “growth” in their name? 

f) What is the deepest common ancestor term of G0:0048527 and 
GO; 00971782 


g) Which GO terms regulate GO: 0007124 (pseudohyphal growth)? 
Hint: load the relationship tags and look for terms which define 
regulation. 


1 ] . : . . 
GOATOOLS version 0.6.4 was used to write this tutorial and the exercises. 
To install this exact version, use pip install goatools--0.6.4 
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GO:0008150 
biological_process 


GO:0044699 
single-organism process 


GO:0009987 
cellular process 


GO:0032502 
developmental process 


GO:0044763 
single-organism cellular process 


GO:0044767 
single-organism developmental process 


GO:0048856 
anatomical structure development 


GO:0048869 
cellular developmental process 


Fig. 1 Selected parts of the Gene Ontology can be visualised using the GOATOOLS library [2] 


Exercise 2.2 
Using the visualisation function in the GOATOOLS library, answer the 
following questions: 


(a) Produce a figure similar to that in Fig. 1, for the GO term 
GO:0097190. From the visualisation, what is the name of this term? 

(b) Using this figure, what is the most specific term that is in the parent 
terms of both GO:0097191 (extrinsic apoptotic signalling pathway) 
and GO: 0038034 (signal transduction in absence of ligand)? This is 
also referred to as the lowest common ancestor (see Chap. 12 [4]). 


Furthermore, other tag-value lines such as the “relation- 
ships" can be loaded with an optional argument of, e.g., 
optional attrs-['relationship!]. 

The GOATOOLS library also includes functions to visualise the 
GO graph. For instance, it is possible to depict the location of a par- 
ticular GO term in the ontology using the method GOTerm. draw_ 
lineage (). For example, the plot in Fig. 1 showing the lineage of 
the GO term GO: 0048527 was created using this function. 

As an alternative to GOATOOLS and OBO files, it is possible 
to retrieve information relating to a specific term from a web ser- 
vice. One such service is the EMBL-EBI QuickGO resource (see 
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Chap. 11; [3,5 ]), which can provide descriptive information about 
GO terms in OBO-XML format. It is possible to request this 
OBO-XML file over HTTP, using a URL of the form 


http://www.ebi.ac.uk/QuickGO/GTerm?id=<GO_ID>& 
format=oboxml 


where <GO_ID> is replaced with the GO identifier for the term of 
interest. In Source Code 2.1, an example function to automate this 
in Python is listed, which uses the urllib library to request the OBO- 
XML and the xmltodict library to parse the XML into an easy to use 
dictionary structure. Both libraries are available to install using pip, 
if required. Note that the future library was used to ensure that the 
function is both Python 2 and 3 compatible. 

The dictionary structure that is returned can vary based on what 
information is available in the database. One example of an informa- 
tion-rich term is GO:0043065. A visualisation of the dictionary 


(ao: generated-by (synonyimiypede) ( default- namespace) { remark ] ( id} (sme) { namespace} [ def Gaz) (s synonym | (m) (| Be (reaionstip) 


Fig. 2 Visualisation of the keys in the hierarchical dictionary structure returned by 
get oboxml('GO:0043065') 


format-version | 


Source Code 2.1. get oboxml() function for Python 2 and 3. 
from future.standard library import install aliases 
install aliases() 

from urllib.request import urlopen 

import xmltodict 


def get oboxml(go id): 

LUI 
This function retrieves the OBO-XML for a 
given Gene Ontology term, using EMBL-EBI's 
QuickGO browser. 
nputc-EEgoN cde > e wWelllaicl Cena atology 10), 
e.g. GO:0048527. 

"m 

quickgo url- "http://ebi.ac.uk/QuickGO/GTerm?id="+ 

go id«"&format-oboxml" 

oboxml - urlopen(quickgo url) 


# Check the response 

it oboxmieget code) 9210/0) 
obodict = xmltodict.parse(oboxml.read()) 
return obodict 

else: 
raise ValueError("Couldn't receive OBOXML 
from QuickGO. Check URL and try again.") 
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structure for this term, created with the visualisedictionary 
package available from PyPI (using pip), has been included in Fig. 2. 
The main advantage of using a web service, such as QuickGO, 
is that there is no requirement to download and parse the entire 
Gene Ontology structure; only the information required is retrieved. 
This is therefore more efficient if only a few particular terms are 
involved in an analysis. By contrast, for analyses involving many 
terms, the file-based approach described above is more suitable. 


Exercise 2.3 
Using the function get. oboxml (), listed in Source Code 2.1, answer 
the following questions: 


(a) Find the name and description of the GO term GO: 0048527 (lat- 
eral root development). Hint. print out the dictionary returned by 
the function and study its structure, or use the visualisation in Fig. 2. 


(b) Look at the difference in the OBO-XML output for the GO terms 
GO:00048527 (lateral root development) and GO:0097178 
(ruffle assembly), then generate a table of the synonymous rela- 
tionships of the term GO:0097178. 


3 Retrieving GO Annotations 


This section looks at manipulating the Gene Association File 
(GAF) standard, using a parser from the BioPython package [6]. 

Firstly, a GAF file, which contains GO annotations, shall be 
downloaded from the UniProt-GOA database [7]. Their website 
(https://www.ebi.ac.uk/GOA/downloads) lists a number of vari- 
ants. For this tutorial the reduced GAF file containing only the gene 
association data for Arabidopsis thaliana is going to be used. 

Annotations from GAF files can be loaded into a Python diction- 
ary using an iterator from the BioPython package (Bio.UniProt. 
GOA.gafiterator). Source Code 3.1 shows a simple example 
of this being used, in order to print out the protein ID for each 
annotation. 


Source Code 3.1 
from Bio.UniProt.GOA import gafiterator 
import gzip 


# filename = <LOCATION OF GAF FILE> 
filename = 'gene association.goa arabidopsis.gz' 


with gzip.open(filename, 'rt') as fp: 
for annotation in gafiterator (fp): 
# Output annotated protein ID 

print (annotation['DB Object ID']) 
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Recall that the latest GAF standard, version 2.1, has 17 tab- 
delimited fields, which are described in detail in Chap. 3 [1]. Some 
of them include: 


'DB': the protein database; 
e 'DB Object ID': protein ID; 


e 'Qualifier': annotation qualifier (such as NOT); 
e 'GO ID':GO term; 
e 'Evidence'!: evidence code. 


Exercise 3.1 


a) Find the total number of annotations for Arabidopsis thaliana with 
NOT qualifiers. What is this as a percentage of the total number of 
annotations for this species? 


b) How many genes (of Arabidopsis thaliana) have the annotation 
GO:0048527 (lateral root development)? 


c) Generate a list of annotated proteins which have the word “growth” 
in their name. 


d) There are 21 evidence codes used in the Gene Ontology project. 
As discussed in Chap. 3 [1], many of these are inferred, either by 
curators or automatically. Find the counts of each evidence code in 
the Arabidopsis thaliana annotation file. 


4 GO Enrichment or Depletion Analysis 


As discussed in detail in Chap. 13 [8] one of the most common 
analyses performed on GO data is an enrichment (or depletion) 
analysis. In this tutorial, the GOEnrichmentStudy () function 
available in the GOATOOLS library (which has been seen in sec- 
tion 2) will be used. 

The GOEnrichmentStudy() function requires the follow- 
ing arguments: 


1. the background set of terms (also known as the “population 
set"), passed as a list of GO term IDs; 


2. associations between proteins IDs and GO term IDs, passed as 
a dictionary with protein IDs as the keys and sets of associated 
GO terms as the values; 


3. the Gene Ontology structure, i.e., the output by the obo 
parser() function from GOATOOLS; 


4. whether annotations should be propagated to all parent terms, 
(defined in terms of is a tags, only), indicated by setting the 
optional boolean parameter propagate counts to True 
(default) or False; 
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5. the significance level, indicated by setting the optional parameter 
alpha to the desired cut-off (default: 0.05); 


6. the foreground set of terms (also known as “study set”), indi- 
cated by setting the parameter study to a list of GO term IDs; 


7. the list of method(s) to be used to assess significance, indicated 
by setting the parameter methods to a list containing one or 
several of these elements: 


(a) "bonferroni": Fisher’s exact test with Bonferroni cor- 
rection for multiple testing; 


(b) "sidak": Fisher’s exact test with Sidak correction for mul- 
tiple testing; 


(c) "holm": Fisher’s exact test with Holm-Bonferroni correc- 
tion for multiple testing; 


(d) "fdr": Fisher’s exact test, controlling the false discovery 
rate (see Chap. 13 [8 ]). 


The function returns the list of over-represented and under- 
represented GO terms in the population set, compared to the 
background set. 


Exercise 4.1 
Perform an enrichment analysis using the list of genes with the “growth” 
keyword from exercise 3.1.c. Use the Arabidopsis thaliana annotation 
set as background, also from exercise 3.1, and the GO structure from 
exercise 2.1. 


(a) Which GO term is most significantly enriched or depleted? Does 
this make sense? 


(b) How many terms are enriched, when using the Bonferroni cor- 
rected p-value < 0.01? 


(c) How many terms are enriched, when using the false discovery rate 
(a.k.a. g-value) € 0.01? 


5 Computing Basic Semantic Similarities Between GO Terms 


In this section, the focus is on computing semantic similarity 
between GO terms, based on ideas presented in detail in Chap. 12 
[4]. Semantic similarity measures enable us to quantify the func- 
tional similarity of genes annotated with GO terms. 

Recall that semantic similarity measures are broadly separated 
in two categories: graph-based and information-theoretic measures. 
The former relies only on the structure of the Gene Ontology 
graph, whilst the latter also accounts for the information content 
of the terms. 

One graph-based measure of semantic similarity, presented in 
Chap. 12 [4], is the inverse of the number of edges separating two 
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terms. It is possible to compute the minimum number of edges 
separating two terms (h, 4) by first finding the deepest common 
ancestor (fpca). Then the difference in depth between each term 
and the deepest common ancestor can be used to calculate the 
minimum distance between the terms. i.e., 


min distance (t, ; t) = depth (t, ) + depth (t, ) — 2x depth (toc a) 


Further, one example of an information-theoretic measure (see 
Chap. 12 [4]) is Resnik’s similarity measure—the information con- 
tent of the most informative common ancestor of the two terms in 
question. The information content of a term is defined as the nega- 
tive logarithm of its probability, which can be estimated from the 
frequency of the term in the annotation database of choice. 


Exercise 5.1 


(a) GO:0048364 (root development) and GO:0044707 (single- 
multicellular organism process) are two GO terms taken from Fig. 1. 
Calculate the semantic similarity between them based on the inverse 
of the semantic distance (number of branches separating them). 

(b) Calculate the information content (IC) of the GO term 
GO:0048364 (root development), based on the frequency of 
observation in Arabidopsis thaliana. 


(c) Calculate the Resnik similarity measure between the same two 
terms as in part a. 
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Abstract 


The specificity of knowledge that Gene Ontology (GO) annotations currently can represent is still restricted 
by the legacy format of the GO annotation file, a format intentionally designed for simplicity to keep the 
barriers to entry low and thus encourage initial adoption. Historically, the information that could be captured 
in a GO annotation was simply the role or location of a gene product, although genetically interacting or 
binding partners could be specified. While there was no mechanism within the original GO annotation 
format for capturing additional information about the context of a GO term, such as the target gene of an 
activity or the location of a molecular function, the long-term vision for the GO Consortium was to 
provide greater expressivity in its annotations to capture physiologically relevant information. 

Thus, as a step forwards, the GO Consortium has introduced a new field into the annotation for- 
mat, annotation extensions, which can be used to capture valuable contextual detail. This provides exper- 
imentally verified links between gene products and other physiological information that is crucial for 
accurate analysis of pathway and network data. This chapter will provide a simple overview of annotation 
extensions, illustrated with examples of their usage, and explain why they are useful for scientists and 
bioinformaticians alike. 


Key words Gene Ontology, Annotation, Biocuration, Context, Pathway, Network, Analysis, 
Annotation extension 


1 Introduction 


Functional annotation of gene products using the GO has gone far 
in simplifying the task of finding functional roles of both individual 
and groups of gene products. It has enabled a multitude of analyses 
that were previously not possible. For example, GO annotations 
are invaluable for analyzing a list of genes that are identified as 
differentially expressed in a microarray experiment using one of 
the many freely available functional enrichment programs [1, 2] 
(see also Chap. 13 [3]). 

The original simplistic GO annotation pairs a gene product 
with a GO term (one of biological process, molecular function or 
cellular component). Because these pair-wise associations are treated 
independently, vast amounts of correlated functional data are 
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omitted from the basic GO annotation and therefore inaccessible 
to network and pathway analyses. This contextual information is 
essential for understanding the physiological roles of gene prod- 
ucts. Without contextual information bioinformatics analyses can- 
not identify gene products that perform a role only under certain 
conditions or in the presence of specific factors and therefore will 
present an incomplete view of the available data [4]. Specific gene 
products will often have different biological roles in different cells 
or tissues as these roles will be dependent on the available interact- 
ing partners; already tissue-specific network analyses are able to 
demonstrate the importance of the cellular environment. For 
example, Greene et al. [5] analyzed the GO and pathway annota- 
tions of the available interaction partners of the transcription factor, 
LEF1, in different tissue types. They demonstrated that LEF1 was 
significantly associated with biological processes that were relevant 
to each tissue type. For instance, in blood vessels the LEF] interact- 
ing partners were associated with angiogenesis, whereas in hypo- 
thalamus they are associated with hypothalamus development. 
Here we describe an incremental extension of the GO annota- 
tion format to allow more detailed statements about gene product 
function, which will benefit all types of functional analyses [5]. 


2 Extending the Core GO Annotation Model 


In practical terms, the newly introduced annotation extensions field 
enables curators to provide appropriate experimentally evidenced 
contextual information for manually curated annotations (extant 
software pipelines for electronically inferred annotations (IEA) do 
not yet support population of this field). 

Generating a comprehensive annotation, one that includes its 
context, involves refining the core pair-wise association with addi- 
tional relationships to other ontology classes [5]. This dynamic 
approach is logically equivalent to creating a new term for the sub- 
type in the ontology, but offers advantages in terms of both flexi- 
bility and efficiency. 

In essence this approach allows curators to dynamically create 
“virtual” terms. It enables curators to combine all of the specific 
terms needed to fully describe a gene product in a way that can be 
reproducibly, computationally interpreted. For example, “core 
RNA polymerase binding transcription factor in hypothalamus”, 
associates a gene product with that activity occurring in that spe- 
cific location. From the computer logic perspective this effectively 
has created a subclass of “core RNA polymerase binding transcrip- 
tion factor activity” (GO:0000990). The flexibility of expression 
thus supports the virtual creation of complex, compound child 
terms on an as-needed basis. Additionally, this approach to virtual 
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term creation is immediate. Because the parent can be automati- 
cally inferred from the primary term of the association, and because 
the additional relationships to other terms provide the refinements 
needed to create a more specific term, the result is that the previ- 
ously independent processes of annotating gene products and cre- 
ating ontology terms are now fully integrated. The use of annotation 
extensions means that curators can immediately make the biologi- 
cal statement required without having to return to the annotation 
to update it only after the term is available in the ontology, thus 
making the overall process more efficient. As these virtual terms are 
not consequently added to the ontology—although they could be 
if required—the extended annotations can be “folded” to create 
the logical equivalent of a GO term [5]. The GO Consortium 
(GOC) is in the process of incorporating these inferred annota- 
tions into the files it provides and so this contextual information 
will be included by default for use by anyone, or any analysis tool, 
that utilizes the annotation files. 


3 Annotation Extension Format 


Annotation extensions refine the GO term used in the basic anno- 
tation by adding one or more relational expressions (extensions). 
Each extension is written as Relation (Entity), where Relation is a 
label describing the relationship between the GO term and the 
entity, and Entity is an identifier for a database object or ontology 
term, for example part of(GO:0005634), where GO:0005634 is 
the Gene Ontology identifier for “nucleus”. 

Relations can be one of two types: “molecular relations" that 
are used with entities such as a gene, gene product, complex, or 
chemical and *contextual relations" that are used with entities 
such as a cell type, anatomy term, developmental stage, or a 
GO term. 

In order to clearly define the semantics of the extensions, rules 
have been implemented defining what types of entity identifiers 
may be used with each relation. Generally, curators may only use 
contextual relations (e.g., where and when) with terms from the 
Cell Type Ontology (CL) [6], Uber Anatomy Ontology (Uberon) 
[7 ], Plant Ontology (PO) [8], nematode life stages (WBls) [9] and 
certain GO terms, and molecular target relations may only apply to 
a physical entity such as a gene product (e.g., UniProtKB [10] or 
PomBase [11]), a macromolecular complex (e.g., Intact Complex 
Portal [12]), or a chemical using a ChEBI [13] identifier. Curation 
tools can incorporate these rules to prevent invalid annotations 
from being created. Table 1 shows the most commonly used relations 
with examples of their usage. 
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Table 1 


Most commonly used relationships for annotation extension statements and examples of their usage 


Contextual relationships 
part_of 


occurs_in 


happens_during 


Molecular relationships 


has_regulation_target 


has_input 


has_direct_input 


Example (gene product; primary GO term; annotation extension) 
C. elegans pst-1; nucleus; part_of(WBbt:0006804 body wall muscle cell) 


Mouse opsin-4; G-protein coupled photoreceptor activity; occurs_ 
in(CL:0000740 retinal ganglion cell) 


S. pombe wis4; stress-activated MAPK cascade; happens_ 
during(GO:0071470 cellular response to osmotic stress) 


Example (gene product; primary GO term; annotation extension) 


Human suppressor of fused homolog SUFU; negative regulation of 
transcription factor import into nucleus; has_regulation_ 
target( UniProtKB:P08151 zinc finger protein GLITI) 


S. pombe rif2: protein localization to nucleus; has_ 
input(PomBase:SPAC26H5.0 pcf2) 


Human WNĶ4; chloride channel inhibitor activity; has_direct_ 
input(UniProtKB:Q7LBES Solute carrier family 26 member 9) 


Molecular relations take an entity such as a gene, gene product, complex, or chemical as an argument; 
contextual relations take an entity such as a cell type, anatomy term, development stage, or a GO 
term as an argument. Entity names in italics are shown for clarity and are not part of the annotation 


extension format. 


Reproduced from Huntley et al. [5]. Open access licence http://creativecommons.org/licenses/by/4.0/ 


4 Improved Expressiveness of GO Annotations: Examples 


41 Targets 
of an Enzyme 


One means of adding value to a GO annotation, using annotation 
extensions, is by specifying the molecular target of an enzyme 
activity. The inability to add effector-target relationships has been 
a major limitation of the core GO annotation model, with this 
addition we can now begin to provide directional information that 
can be used for network and pathway analyses. Take as an example 
the annotation. of human mitogen-activated protein kinase- 
activated protein kinase 2 (MAPKAP-K2), which was shown to 
phosphorylate the CapZ-interacting protein (CapZIP) [14]. A 
basic GO annotation would describe MAPKAP-K2 as a protein 
serine /threonine kinase: 


Gene product: UniProtKB:P49137 (human MAPKAP-K2) 


GO term: GO:0004674 (protein serine/threonine kinase activity) 


Using an annotation extension, a curator can add more detail 
as follows: 


Gene product: | UniProtKB:P49137 (human MAPKAP-K2) 
GO term: GO:0004674 (protein serine/threonine kinase activity) 
Extension: has_direct_input( UniProtKB: Q6JBY9) (human CapZIP) 


42 Anatomical 
Location of a Gene 
Product’s Function 


43 Timing-Specific 
Location of a Gene 
Product 
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N.B. phrases in italics are not part of the syntax but are added 
for better interpretation by the reader. 

The extended GO annotation describes MAPKAP-K2 as a pro- 
tein serine/threonine kinase that can phosphorylate CapZIP. This 
is vital information that can be utilized for linking together pro- 
cesses and pathways that MAPKAP-K2 and CapZIP, and any fur- 
ther targets of these proteins, are involved in. The rules of usage 
for has direct input are that the primary GO term used should be 
a Biological Process or Molecular Function and in this example the 
term used is a Molecular Function, additionally the entity used in 
the extension should be a gene product, macromolecular complex, 
or chemical and in this example it is a gene product, i.e., a pro- 
tein. Note that has direct input was used here instead of has - 
input because there was evidence in the paper that MAPKAP-K2 
acted directly on the substrate CapZID, if there was a possibility 
of an intermediate molecule in this reaction, has input would 
have been used. 


An annotation can be extended to specify the locational context in 
which a gene product performs its roles. It is important to note 
that we intend only to capture those locations that are physiologi- 
cally relevant to the organism and not the experimental detail in 
which the observation was made. 

The rat protein dihydrofolate reductase (Dhfr) was shown to 
reduce dihydrofolic acid to tetrahydrofolic acid in rat neurons [15]. 
From this evidence a basic GO annotation could be made as follows: 


Gene product: | UniProtK B:Q920D2 (rat Dhfr) 
GO term: GO:0004146 (dihydrofolate reductase activity) 


By extending the annotation the curator can also specify in 
which cell type this activity occurs: 


Gene product: UniProtKB:Q920D2 (rat Dhfr) 
GO term: GO:0004146 (dihydrofolate reductase activity) 
Extension: occurs in(CL:0000540) (neuron) 


This annotation now provides the physiologically relevant 
information that Dhfr is active in neurons. The rules for occurs in 
are that the primary GO term used must be a Biological Process or 
Molecular Function (in this example it is a Molecular Function); 
additionally the entity in the extension must be a cell type, ana- 
tomical feature, or GO Cellular Component (in this example it is 
an identifier from the Cell Type Ontology). 


A gene product's annotation may be made more specific by including 
the appropriate developmental stage. An example is the location of 
the C. elegans PAXT-1 protein, which is located in the nucleus during 
the embryo stage [16]. Using the basic GO annotation format, a 
curator might indicate that PAXT-1 is located in the nucleus: 
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Gene product: — UniProtKB:Q21738 (C. elegans PAXT-1) 
GO term: GO:0005634 (nucleus) 


By extending the annotation the curator can also specify when 
this localization occurs: 


Gene product: UniProtKB:Q21738 (C. elegans PAXT-1) 
GO term: GO:0005634 (nucleus) 
Extension: exists during(WBls:0000003) (embryo) 


This annotation means that PAXT-1 is located in the nucleus 
during the C. elegans embryo stage. The rules for exists during are 
that the primary GO term used should be a Cellular Component 
and the entity in the extension should be a developmental stage 
or a GO Biological Process, in this case the entity is from the 
C. elegans life stage ontology. 


44 Multiple If several contextual statements can be made for the gene product, 

Relational Expressions it is possible to combine relational expressions to make even more 
complex statements. Relational expressions can be separated by 
commas “,” (meaning AND) or by pipes, “|” (meaning OR), 
depending on whether the conditions in the statement are co- 
occurring (AND) or independent (OR). 

The human microRNA miR-145 provides an example of the 
application of multiple annotation extensions. MiR-145 was shown 
to directly bind and silence the POU5F1 transcription factor, 
among others, causing inhibition of embryonic stem cell division 
[17]. This evidence could therefore be represented by two basic 
GO annotations as follows: 


Gene product: RNACentral: URS0000527F89. 9606 (human miR-145) 


GO term: GO:1903231 (mRNA binding involved in 
posttranscriptional gene silencing) 


Gene product: RNACentral: URS0000527F89. 9606 (human miR-145) 


GO term: GO:1904676 (negative regulation of somatic stem cell 
division) 


Using relational expressions, separated by commas, we can 
make one extended annotation as follows: 


Gene product: RNACentral: URS0000527F89. 9606 (human miR-145) 


GO term: GO:1903231 (mRNA binding involved in 
posttranscriptional gene silencing) 


Extension: has direct input(Ensembl:ENSG00000204531), 
occurs in(CL:0002322), part of(GO:1904676) 
(human POUSFI, embryonic stem cell, negative 
regulation of somatic stem cell division) 
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The extended annotation signifies that miR-145 directly binds 
and silences POU5F1 mRNA expression as part of the inhibition 
of somatic stem cell division of embryonic stem cells. Again, this 
contextual information will be essential information when analyz- 
ing the physiological relevance of the role of a gene product in a 
pathway. 

Although the use of a pipe (|) to indicate independent contex- 
tual statements does not provide any additional expressivity to the 
statements already made, it allows a curator to capture several 
statements from the same evidence within a paper. An example is 
when specifying the multiple substrates of an enzyme—the enzyme 
may act on each of the substrates independently, but not all at the 
same time; therefore, the substrates can be listed in the extension 
separated by pipe symbols: 


Gene product: UniProtKB:O14522 (buman PTPRT) 
GO term: GO:0004725 (protein tyrosine phosphatase activity) 
Extension: has direct input(UniProtKB:P12830) | 


has direct input(UniProtKB:O60716) 
(E-cadherin| CI NNDI) 


This annotation indicates that the receptor protein tyrosine 
phosphatase rho (PTPRT) dephosphorylates E-cadherin and 
CTNNDI, but not necessarily both simultaneously. It would be 
equally correct to create two separate annotations each with a single 
substrate in the extension. 


5 Practical Use of Extended Annotations 


There are likely to be many use cases for extended annotations— 
even some we have not yet envisioned. Users will be able to per- 
form more advanced queries with the available functional data; such 
as filtering on the subcellular, cellular or anatomical locations in 
which a gene product performs its roles, or which genes a transcrip- 
tion factor regulates in a specified cell type. Annotation extensions 
can also help create functional networks through the use of direc- 
tional relationships such as has input and has direct input, which 
allow specification of the target of an effector, for example in a sig- 
naling pathway or the substrates of a metabolic enzyme activity. 
Without contextual detail, bioinformatics analyses of gene prod- 
ucts involved in a specified process cannot distinguish, for example, 
between those gene products that are active only in a particular cell 
type and those that are inactive or absent from that cell type, there- 
fore creating a bias in the interpretation of the data. With extended 
annotations any differences in the active components of a process or 
pathway between various cell or tissue types can be determined. 
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5.1 Access Extended annotations are available for download in the current 
GO annotation files, both in the GAF2.0 format (column 16; 
http://www.geneontology.org/GO.format.gaf-2_0.shtml) and in 
the Gene Product Association Data format (GPAD column 11; 
http://www.geneontology.org/GO.format.gpad.shtml) ^ These 
files can be accessed from the GOC website (http://geneontol- 
ogy.org/GO.downloads.annotations.shtml) and the GOA website 
(http: //www.ebi.ac.uk/GOA/downloads). 

Extended annotations can be accessed on the web via the GO 
browsers QuickGO ([18]; www.ebi.ac.uk/QuickGO-Beta) and 
AmiGO 2 ([19]; http://amigo.geneontology.org/amigo/). Both 
browsers allow users to filter annotation sets based on the contents of 
the annotation extension. The display of extended annotations may 
be different depending on the resource (Fig. 1), but the GO annota- 
tion files display the plain text extension since this is more compatible 
for computational analysis (see also Chap. 11 [20]). Any questions on 
how to access or use extended annotations should be directed to the 
GOC helpdesk (http: //geneontology.org/form /contact-go). 


5.2 Exercise The addition of extended annotations to Gene Ontology datasets 
enables users to perform sophisticated queries. This exercise will 
demonstrate how to build such a query in the GOC browser 


Gene 
Product | Symbol Qualifier GO Identifier GO Term Name | Aspect Evidence Reference with Taxon Assigned By Annotation Extension 
1D 
QSSO07  LRRK2 enables d GO:0000149 SNARE binding F ECO:0000353 (IPI)  PMID:21307259 UniProtKB:P46460 9606  ParkinsonsUK-UCL Q occurs in 
UBERON:0000555... 
QSS007  LRRK2 enables @ GO:0001948 glycoprotein binding F £CO0:0000353 (IPT) PMID:21307259 UniProtKB:Q9)IS5 9606  ParkinsonsUK-UCL @ occurs in 
UBERON:0000955... 
Annotation 
Gene/product Gene/product name Qualifier Direct annotation Assigned by Taxon Evidence Evidence with PANTHER family Isoform Reference Date 
LRRK2 Leucine-rich repeat SNARE binding occurs in brain ParénsonsUK-UCL Homo IPI UniProtKB:P 46460 leucine-rich PMID21307259 20140513 
serine/threonine- sapens repeat-containing 
Protein kinase 2 ^e 
pt23155 
LRRK2 Leucine-rich repeat glycoprotein binding occurs in brain ParénsonsUK-UCL Homo IPI UniProtKB-Q9.ISS leucine-rich PMID21307258 20140513 
serine/threonine- sapens repeat-containing 
protein kinase 2 protein 


c 


| Home | find gs eee 


* GO Molecular Function 


( Summary e| Ontoiogy Graph. 
7 Report am error | Pec 

Term Name Count 
ATP binding 558 


* has substrate swi, byrt 
© part of regulation of APC-fuzy related complex activity, has substrate srw! 
* in presence of cig2, has substrate cóc18 

* in presence of 5:13, has substrate cóc 18 


Fig. 1 Display of extended annotations in (a) the beta version of the EBI GO browser QuickGO, (b) AmiGO 2, and 
(c) PomBase (http://www.pombase.org/) 
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AmiGO 2, namely, to provide all of the gene products from S. pombe 
that are located in the spindle midzone during mitotic anaphase. 


l. 


a 


Open the AmiGO 2 browser (http://amigo.geneontology. 
org/amigo/ ). 


. Click on the Advanced Search button and select “Annotations” 


from the drop-down list. 


. In the free-text filtering box on the left (Fig. 2a) type in 


GO:0051233, the GO identifier for the Cellular Component 
term “spindle midzone”. 


. Now open the Taxon menu on the left and click on the “more” 
P 


button at the bottom. A pop-up menu will open, in the top 
filter box start typing “pombe”--“Schizosaccharomyces 
pombe” should be the only option that appears. Click on the + 
next to the species name to add this to the filter. 


. Now open the Annotation Extension menu on the left and 


click on the + button next to the term “mitotic anaphase” to 
add this to the filter. 


. AmiGO 2 will display all of the annotations that use the 


*mitotic anaphase" term (or one of its child terms) in the 
annotation extension of a primary annotation to “spindle mid- 
zone" (or one of its child terms) (Fig. 2b). 


Free-text filtering 


Le) 


(GO:0051233 


Your search is pinned to these filters 
* document category: annotation 


User filters X. 
* annotation extension class closure label: LX 
mitotic anaphase 
+ taxon closure label: Schizosaccharomyces X. 
pombe 
Found entities 
Total: 7; showing 1-7 Resultscount 10 B 
? (a 
Direct Assigned Evidence PANTHER 
Gene/product Gene/product 
name Qualifier Annotation extension by Taxon Evidence with Isoform Reference Date 
art aurcra-B kinase Art spindle midzone exists during mitotic — PomBase Schizosaccharomyces IDA PMID-11792803 20041203 
anaphase pombe 
Kpa kinesin-like protein Kipa spinde midzone exists during mitotic PomBase Schizosaccharomyces — IDA PMID-19686686 20090820 
anaphase B pombe 
peg! CLASP family microtubule- spindle midzone exists during mitotic — PomBase Schizosaccharomyces IDA PMID:16951255 20070205 
associated protein anaphase pombe 
bir! survivin, Birt Spindle midzone exists during mitotic PomBase Schizosaccharomyces IDA PMID-11861551! 20041203 
anaphase A pombe 
birt survivin, Birt Spindle midzone @dsts_during mitotic — PomBase Schizosaccharomyces IDA. PMID-.16824200 20140513 
anaphase B porte 
nbit Borealin homolog Nbi1 spindle midzone exists during mitotic — PomBase Schizosaccharomyces IDA PMID-19570910 20090707 
anaprase pombe 
cur kinesin-like protein Cut? mitotic spindie — exists during mitotic — PomBase Schizosaccharomyces IDA PMID-1538784 20130125 
midzone anaphase B pornde 


Fig. 2 Finding annotations in AmiGO 2 based on annotation extension data. (a) Filters applied in the AmiGO 2 
browser: GO:ID (GO:0051233 “spindle midzone”), annotation extension (mitotic anaphase), taxon 
(Schizosaccharomyces pombe). (b) Results of the search using the filters applied in (a). Six unique gene prod- 
ucts are located to the spindle midzone during mitotic anaphase 
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6 Summary 
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The Evidence and Conclusion Ontology (ECO): Supporting 
GO Annotations 


Marcus C. Chibucos, Deborah A. Siegele, James C. Hu, 
and Michelle Giglio 


Abstract 


The Evidence and Conclusion Ontology (ECO) is a community resource for describing the various types of 
evidence that are generated during the course of a scientific study and which are typically used to support 
assertions made by researchers. ECO describes multiple evidence types, including evidence resulting from 
experimental (i.e., wet lab) techniques, evidence arising from computational methods, statements made by 
authors (whether or not supported by evidence), and inferences drawn by researchers curating the literature. 
In addition to summarizing the evidence that supports a particular assertion, ECO also offers a means to 
document whether a computer or a human performed the process of making the annotation. Incorporating 
ECO into an annotation system makes it possible to leverage the structure of the ontology such that associ- 
ated data can be grouped hierarchically, users can select data associated with particular evidence types, and 
quality control pipelines can be optimized. Today, over 30 resources, including the Gene Ontology, use the 
Evidence and Conclusion Ontology to represent both evidence and how annotations are made. 


Key words Annotation, Biocuration, Conclusion, Confidence, Evidence, ECO, Experiment, 
Inference, Literature curation, Quality control 


1 Describing Evidence in Scientific Investigations 


1.1 Importance Investigations in the life sciences routinely produce data from 
of Documenting diverse methodologies using a wide range of tools and techniques. 
Evidence Such data generated during the course of a research project con- 


tribute to the pool of evidence that ultimately leads a scientific 
researcher to make a particular inference or draw a given conclu- 
sion. Ultimately, one goal of a scientist is to publish the conclu- 
sions that are drawn from a given research project in the scientific 
literature. Such conclusions typically take the form of assertions, 
i.e., statements that are believed to be true, about some aspect of 
biology. The process of biocuration seeks to extract from the litera- 
ture the assertion that summarizes the research finding in addition 
to any relevant evidence in support of the finding. Ideally, both of 
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Fig. 1 Representing experimental methods and conclusions in a biological database. (a) An experiment is 
performed that generates data. (b) A researcher interprets methods and data, and draws conclusions that are 
published in a scientific journal and indexed in PubMed, for example. (c) A biocurator reads that paper, inter- 
prets the results presented therein, and makes an assertion. (d) The assertion is represented by associating an 
ontology term with the item being studied and stored along with other data, for example a protein sequence, 
at a biological database. (General summaries and related ECO classes are depicted along the bottom.) 


1.2 Multiple Types 

of Evidence and Ways 
of Associating Evidence 
with Assertions 


these pieces of information will become integrated into a database 
in a structured way, so that they are readily accessible to the scien- 
tific community [1, 2] (Fig. 1). 

Recording evidence is essential because: (1) knowing what 
methodologies were used is central to the scientific method and 
can impact one's evaluation of the data or results; (2) associating 
evidence with data maintained electronically allows for selective 
data queries and retrieval from even the largest of databases; and 
(3) a structured representation of evidence makes automated qual- 
ity control possible, which is absolutely essential to managing the 
ever-increasing number and size of biological databases. 


Evidence can be associated with assertions in many ways. Manual 
curation is a common approach [3, 4], outlined in Fig. 1. However, 
text mining or other computational methods can also be used to 
extract biological assertions from the scientific literature [5, 6], 
and assertions can also be made directly via bioinformatic tech- 
niques [7], e.g. assigning of functional annotations as resulting 
from a functional genome annotation pipeline. 

Numerous types of evidence form the bases for assertions that 
are made by researchers. Laboratory and field experiments are com- 
mon sources of evidence, but computational (or in silico) analysis, 
whether executed by a person or an unsupervised machine, can also 
generate the evidence that is used to support assertions about bio- 
logical function (Fig. 2). In addition, conclusions can be synthesized 
from investigator speculation or implied by known biology during 
the literature curation process. We can also consider provenance, a 
concept related to and sometimes conflated with evidence. A central 
goal of biological data repositories is to record in a structured fash- 
ion as much information as is known about the origins of a given 
accession. Yet sometimes an accession is imported from another 
database where the source for the annotation at that database is 
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Fig. 2 Computational evidence and assertion. (a) A human or computer performs an analysis, for example 
comparing the sequence of a protein of unknown function to sequences at a database. A protein of known 
function is returned as a hit with corresponding alignment. (b) The alignment is analyzed and the protein 
sequences are deemed to share enough similarity to be considered homologs (related through common evo- 
lutionary descent). The query protein is assigned the same function as the database protein. (c) This informa- 
tion is stored at a sequence repository along with other data and metadata. (Text in white boxes depicts 
evidence and assertion methods used in this process.) 


unclear. Even in this case it might be useful for the importing data- 
base to note the source of the statement/annotation along with a 
description of *imported information," indicating that nothing else 
is known about the evidence or provenance of that particular anno- 
tation. Thus there are numerous advantages to capturing scientific 
evidence and provenance, from describing specific methodologies to 
representing chains of custody. 


2 TheEvidence and Conclusion Ontology (ECO) 


2.1 The Argument 
for an Ontology 
of Evidence 


Due to the diversity of ways that exist to describe the multitude of 
scientific research methodologies, a means of representing evidence 
in a descriptive but structured way is required in order to maximize 
utility. The most efficient way to achieve this is to use an ontology, 
a controlled vocabulary where each term is well-defined and linked 
to other terms via defined relationships [8, 9]. In an ontological 
framework, evidence descriptions are represented not as free text, 
but rather as networked ontology classes where each child term is 
more specific (granular) than its parent [10]. High-level descrip- 
tions of types of evidence (such as *experimental evidence") are 
contained in more basal classes closest to the root class evidence. 
Increasingly specific terms that are grouped under the more general 
classes describe particular sub-types of evidence (such as *chroma- 
tography evidence"). The most specific terms, the so-called *leaf 
nodes" that contain no child terms, represent the most granular 
types of evidence generated during the course of a scientific investi- 
gation (for example “thin layer chromatography evidence”). The 
Evidence and Conclusion Ontology (ECO) (http://eviden- 
ceontology.org) was created to enable the structured description of 
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experimental, computational, and other evidence types to support 
the assertions captured by scientific databases [11]. 


22 ABrief As described throughout this book, the Gene Ontology (GO) uses 

History of ECO terms organized into controlled vocabularies, and the relationships 
among these terms, to capture functional information about gene 
products. The need to systematically document evidence while 
curating annotations was recognized from the inception of the GO 
[12] and a set of “evidence codes” was created for this purpose [13]. 
In time it was realized that a better-structured and more compre- 
hensive way to represent evidence was required. Thus, the set of 
initially created GO codes, along with terms created by two model 
organism databases, FlyBase [14] and The Arabidopsis Information 
Resource [15], evolved into the first version of ECO, the “Evidence 
Code Ontology”. Since then, the use of ECO by other resources has 
continued to grow and the ontology has shifted its focus beyond 
GO in order to become a generalized ontology for the capture of 
evidence information. The official name of ECO is now the 
“Evidence and Conclusion Ontology”. ECO is presently being 
developed to define and broaden its scope, normalize its content, 
and enhance interoperability with related resources. The GO remains 
an active user and participant in developing ECO. It is anticipated 
that soon the three letter GO evidence codes to which so many are 
accustomed will be replaced by ECO term identifiers. 


evidence 


assertion method 
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sequence similarity evidence 


match to sequence model 
evidence 


match to InterPro member 
signature evidence 
T E 


match to InterPro member match to InterPro member 
signature evidence used in signature evidence used in 
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Fig. 3 Simplified representation of ECO, depicting general structure. ECO comprises two root classes along 
with their respective hierarchies, evidence (terms in black) and assertion method (terms in pink). A given type 
of evidence can be applied to (used im; dotted lines) automatic assertion or manual assertion, which neces- 
sitated the creation of ECO leaf nodes that are evidence x assertion method cross products. For simplicity, most 
ECO classes are not displayed in the figure, including, for example, five of eight direct subclasses of evidence 
or three of four types of similarity evidence and so on 


23 ECO Structure 
and Content 


2.3.1 Extending ECO 
Beyond GO 
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Evidence terms descend from the root class “evidence”, which is defined 
as “a type of information that is used to support an assertion” (Fig. 3). 
Most evidence terms are either experimental or computational in nature, 
e.g., “chromatography evidence” or “sequence similarity evidence”, 
respectively (Fig. 3). However, ECO also comprises other types of evi- 
dence, such as “curator inference” and “author statement”. 

In addition to describing evidence, ECO can also describe the 
means by which assertions are made, i.e., by a human or a machine. 
ECO calls this the “assertion method” and defines it as “a means 
by which a statement is made about an entity” (Figs. 1c and 2b). 
For example, whether a curator makes an annotation after reading 
about an experimental result in a scientific paper or after manually 
evaluating pairwise sequence alignment results, ECO can express 
that a manual curation method was used (3,8). Conversely, if an 
algorithm was used to assign a predicted function to a protein, 
ECO can express that an automated computational method was 
used. Thus “assertion method” forms a second root class with two 
branches: “manual assertion” and “automatic assertion” (Fig. 3). 

The current version of ECO comprises 630 terms that describe 
“evidence”, “assertion method”, or “evidence x assertion method” 
cross products. Ontology architecture of ECO was recently 
described in Chibucos et al. [11]. 


Recent development efforts of ECO have emphasized meeting the 
needs of a larger research community; see for example [11, 16], 
while still capturing the needed information for GO annotation, 
such as by adding comments and synonyms to a term. Many high- 
level ECO term definitions were written with explicit GO usage 
notes contained therein because ECO originated during early 
efforts of the GO. However, in order to increase overall usability of 
ECO by resources other than the GO, such verbiage has been 
removed, while retaining the essence of the term’s meaning and 
applicability to GO. As ECO has been developed, more and more 
granular terms have been created to represent increasingly com- 
plex laboratory, computational, and even inferential techniques. 

A discussion of ECO and GO would not be complete without 
mention of the GO evidence code IEA or “inferred from electronic 
annotation”. IEA is used to connote that an annotation was 
assigned through automated computational means, e.g., transfer- 
ring annotations from one protein to another. Because IEA 
describes how an annotation was assigned, rather than the specific 
type of supporting evidence, this term belongs as a subclass of 
“assertion method”. As described above, “assertion method” has 
two child terms, “manual assertion” and “automatic assertion”, 
with the latter being equivalent to IEA. Now it is possible to more 
accurately model evidence and the annotation process using ECO. 

Aside from rewording definitions and creating a second root class, 
the biggest conceptual modification of ECO is reflected by removal of 
the prefix “inferred from” from every term name (see the GO codes 
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for a sense of how ECO terms were previously labeled). This was done 
because ECO considers not just inferences made during the curation 
process, per se, but other aspects of evidence documentation, such as 
what research methodologies were performed. 


3 Fundamentals of Evidence-Based GO Annotation 


3.1 ECO Terms 
Versus GO Codes 


Creating an association between a GO term and a gene product 
is the fundamental essence of the GO annotation process. 
Documenting the evidence for any given GO annotation is a critical 
component of this annotation process, and an annotation would be 
incomplete without the requisite evidence. In fact, evidence capture 
by the GO requires both a *GO evidence code" that describes in 
detail the type of work or analysis that was performed in support of 
the annotation, as well as a citation for the reference from which the 
evidence was derived. Curators go to great lengths to understand 
and properly apply the correct “evidence code” to a given annota- 
tion, and an online guide exists to explain the often-subtle distinc- 
tions between multiple related evidence types (http:/ /geneontology. 
org/page/ guide-go-evidence-codes) [4, 13]. 

The GO gene association file (GAF) format contains required 
columns for both evidence code and reference. Each GO evidence 
code maps directly to an ECO term. ECO maintains database cross 
references to the GO codes for easy mapping between systems. GO 
codes therefore represent a subset of the Evidence and Conclusion 
Ontology. Since independent development of ECO was undertaken, 
a number of new GO evidence codes have been created, e.g., IBA, 
IBD, IKR, IRD. Equivalent terms have been instantiated in ECO 
(Fig. 4a), which will continue to develop such terms for the GO. 


Although GO evidence codes are useful in themselves because they 
represent detailed descriptions of evidence types, they are main- 
tained as a controlled vocabulary with a shallow hierarchical struc- 
ture thatlacks the advantages ofa formal ontology like ECO. Further, 
the full set of terms within ECO provides the ability to capture 
more breadth and depth of evidence information than the GO evi- 
dence codes do. Additionally, as the field of biocuration evolves and 
the kinds of evidence being curated from the literature continue to 
grow both more detailed and nuanced, the number of two- and 
three-letter acronyms (e.g., IEA, IMP, EXP, and ISS) available for 
new terms will hit an upper limit (there are only 676 possibilities 
using all 26 two-letter combinations, as the first letter of the three- 
letter GO codes often stands for “inferred” ). In fact, ECO develop- 
ers have already received requests from different users to develop 
new, but unrelated, terms that had the same suggested three-letter 
acronyms. For all of these reasons, there are discussions underway 
about transitioning GO evidence storage to use ECO terms rather 
than GO evidence codes. Such a shift would combine the 
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Fig. 4 Applications of ECO to GO. (a) ECO evidence classes are hierarchical such that broader classes parent 
more granular ones; depicted here are evidence types that support a phylogenetic tree-based approach for 
generating manually reviewed, homology-based annotations. (b) When a protein is annotated based on 
sequence similarity to another annotated protein, the identity of that protein must be recorded in the annota- 
tion file along with the evidence. (c) Quality control assessment: Expression pattern evidence is only allowable 
for annotations to the GO Biological Process ontology. (d) Evidence is used to prevent circular annotations 
based solely on computational predictions. Chains of evidence are computationally evaluated to ensure that 
inferential annotations are linked to experimental evidence 


advantages of both systems and would still provide a mechanism for 
filtering evidence annotations by the previous codes if desired. If 
ECO terms were to be fully adopted by GO, the GAF format would 
change to require “ECO term” instead of “evidence code.” Since 
most GO evidence codes have a one-to-one mapping to ECO terms 
(while the remainder, i.e., IEA, IGC, ISS, map, in conjunction with 
various GO standard references [http:/ /purl.obolibrary.org/obo/ 
eco/gaf-eco-mapping.txt], to specific ECO terms), GO data depos- 
itors could use a straightforward replacement based on the map- 
pings. Other resources outside of GO have modeled their annotation 
capture systems on the GAF format. For example, the Ontology of 
Microbial Phenotypes [17] uses a modified version of the GO GAF, 
but employs ECO terms instead of GO evidence codes. The full use 
of ECO terms by the GO would enhance the integration of data 
derived from such diverse sources. 
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4 Benefits of ECO and Applications for the GO 


There are currently over 365 million annotations in the GO reposi- 
tory linked to an evidence term, and these can be queried and main- 
tained better with the help ofan ontology by leveraging its hierarchical 
structure. One of the most direct applications for using an ontology of 
evidence is selective data query, i.e., to query a database for records 
associated with a particular evidence type. For example, searching for 
“thin layer chromatography evidence” (at present a leaf term with no 
subclasses) would return only the records associated with that evi- 
dence type and no others. But grouping annotations is also possible 
with this approach. A query for “chromatography evidence” will 
return data associated not only with “chromatography evidence” but 
also its more specific subtypes including “thin layer chromatography 
evidence” and “high performance liquid chromatography evidence”. 
But there are further benefits to be derived from an ontology of 
evidence beyond simple structured queries (Fig. 4). For example: 


1. To amplify the benefits of experimental knowledge that cura- 
tors capture, the GO Consortium is using a phylogenetic tree- 
based approach to generate manually reviewed, homology-based 
annotations for a range of species [18]. This phylogenetic 
annotation methodology necessitated a new set of evidence 
terms to capture the inference process (Fig. 4a). Currently 
over 150,000 annotations are associated with these new terms 
and the number continues to grow. 


2. The GO curatorial process uses evidence to support comput- 
able rules about the kinds of information that must be associ- 
ated with different evidence types. For example, one rule states 
that annotation of a protein based on alignment with another 
protein requires that the identity of the matching protein be 
captured, along with the evidence type “protein alignment evi- 
dence” (Fig. 4b). If such an evidence type were missing, this 
would flag the annotation for review. 


3. The GO uses evidence as a quality control mechanism for 
annotation consistency. For example, expression pattern evi- 
dence is restricted to annotations for terms from the “biologi- 
cal process” ontology. Annotations to terms from either of the 
other two GO ontologies (“molecular function” or “cellular 
component”) would be flagged as suspect (Fig. 4c). 


4. Evidence is used to prevent circular annotations based solely 
on computational predictions (Fig. 4d). Chains of evidence are 
computationally evaluated to ensure that inferential annota- 
tions are linked to experimental evidence. For example, anno- 
tations supported by “sequence alignment evidence” require 
the inclusion of a database identifier for the matching gene 
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a 
AmiGO 2 


Hint: add a space after completing a word to 
narrow the search. 


proteolysis 


Search 


Text search document selection © 


The following results were found for proteolysis using a general search over all text fields. 


To narrow your search, select the type of document that you would like to search for and 
continue narrowing your search from the linked search page. 


Ontology Gene Ontology Term, Synonym, or Definition. (416) 


GO Models (ALPHA) An individual unit within LEGO. This is ALPHA software. o 


Genes and gene products Genes and gene products associated with GO 


terms. 


Associations between GO terms and genes or gene products. D 135271] 


Protein families Information about protein (PANTHER) families. o 


A generic search document to get a general overview of everything. e 


Noctua meta A generic capture of light Noctua metadata in realtime. o 


Fig. 5 AmiGO 2 query and results. (a) User has typed "proteolysis" into the search box. (b) Number of hits (right 
gray box) shown for each document category (blue boxed text). Clicking on “Annotations” will open a new page 
with more detailed results 


product that is itself linked to an annotation supported by 
experimental evidence. 


Yet another application of ECO for the GO has been realized in 
the UniProt-Gene Ontology Annotation (UniProt-GOA) project. 
Arguably, UniProt is the most comprehensive and best-curated pro- 
tein database available to the research community. ECO terms have 
replaced the original UniProtKB [19] evidence types and are available 
in UniProtKB XML [11]. Novel ways of mapping and extending 
ontologies have been discussed with ECO and the GO Consortium 
to ensure appropriate development for UniProtKB annotation. The 
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a Filter results 


Total annctationts: CID 


proteotysis 


No current user fiters. 


GO class (direct) extension Contributor Organism Evidence Evidence with family Isoform Reference Date 
iproteolysis ivoived in GO Central Schistosoma IBA PANTHER:PTNO00274640 cysteine PAINT REF-12411 20150820 
cellular protein mansoni protease tarnily 
catabolic process C1-related 
pthri2411 
cysteine-type GO Central Schistosoma IBA PANTHER:PTNO00274640 cysteine PAINT REF-12411 20150320 
endopeptidase activity mansoni protease tamy 
C1-related 
pthri2411 
metalloaminopeptidase GO Central Batrachochytrium IBA PANTHER:PTNOOO164249 protease m1 PAINT REF-11533 20150424 
activity dendrobatidis zine 
JAMBI metalloprotease 
pthri 1533 
proteolysis: GO Central Batrachochytrium IBA PANTHER:PTNOOO164249 protease m1 PAINT REF-11533 20150424 
dendrobatidis zinc 
JAMBI metalloprotease 
pthri 1533. 


GO Central Monodelphis IBA PANTHER:PTN000134256 PAINT REF.11254 20110204 
domestica 


Fig. 6 Annotation hits to a query search. (a) To the left of the search results, the user has an opportunity to click 
on filters. (b) To the right, each annotation row is shown for a given protein 


4.1 Exercise 


UniProt-GOA project provides >169 million manual and electronic 
evidence-based associations between GO terms and 26.5 million 
UniProtKB proteins covering >411,000 taxa [20]. Of these, manual 
annotation provides 1.4 million annotations to ~260,000 proteins. 
Since 2010, UniProt-GOA has supplied GO annotations in a Gene 
Product Association Data (GPAD) file format, which allows inclusion 
of ECO terms. Because ECO terms are cross referenced to corre- 
sponding GO codes, even if evidence for annotations was supplied to 
UniProt as GO codes, the GPAD file will display the appropriate 
equivalent ECO term. Thus, UniProt annotations can be grouped by 
leveraging the structure of ECO. 


Once the reader has gained a basic understanding of ECO and its 
connection to GO, we can perform the following simple exercise 
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(80049) biological aspect of ancestor evidence used in manual 
assertion 
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(7587) sequence similarity evidence 
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(3906) sequence orthology evidence 
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Fig. 7 Selected ECO terms in use by the GO Consortium that are related to the 
present query. The number of annotations supported by a given evidence type is 
shown in parentheses 


that displays a faceted query using ECO in AmiGO 2 (http:// 
amigo2.geneontology.org/amigo). 

User types “proteolysis” into the query box (Fig. 5a) and sees a 
number ofhits returned (Fig. 5b). Next, after clicking on “Annotations” 
in the blue rectangle, the user sees all the annotation-related terms 
that had hits to “proteolysis” (Fig. 6a, b), split into two parts here for 
easier viewing. Clicking on “Evidence” in the filter box (Fig. 6a) will 
expand it to display all constituent evidence types (Fig. 7). Clicking on 
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Total annotation(s): 2743; showing: 1-10 
Results count | 10 


Se 


Gene/product Annotation Annotation Evidence PANTHER 
Gene/product name qualifier GO class (direct) extension Contributor Organism Evidence with family Isoform Reference Date 
RPT3 proteasome ubiquitin-dependent GeneDB Trypanosoma TAS PMID:11854272 20150313 
regulatory protein catabolic brucei brucei 
ATPase process TREUS27 
subunit 3 
BETAS proteasome endopeptidase GeneDB Trypanosoma TAS PMID:9741626 20150313 
beta 6 subunit activity brucei bruce! 
TREUS27 
RPN6 proteasome ubiquitin-dependent GeneDB Trypanosoma TAS PMID:11854272 20150313 
regulatory protein catabolic brucei brucei 
non-ATPase process TREUS27 
subunit 6 
T0927.10.6080 proteasome endopeptidase GeneDB Trypanosoma TAS PMID:9741626 20090731 
subunit beta activity brucei brucei 
type-5, TREUS27 
putative 


Fig. 8 Filtering on evidence. After filtering on “traceable author statement used in manual assertion”, only 
annotations supported by that evidence type are displayed, shown as “TAS” in the “Evidence” column. Number 
of annotations associated with that evidence type is shown at the top left 


5 The Future of ECO 


“traceable author statement used in manual assertion” will open a 
subset of the results that match that more restrictive filter (Fig. 8). 
The evidence filter box now says “Nothing to filter” (Fig. 9). 


What else can an ontology of evidence do? One aspect of active 
exploration for ECO is the evaluation of confidence or quality of 
evidence. Work has begun [21] to develop a mechanism to incor- 
porate quality information into ECO or, as needed, to create a 
standalone system. It might one day be possible to use ECO to 
describe the guality of the evidence supporting an annotation in 
addition to the type of evidence that supports the annotation. 

In summary, the Evidence and Conclusion Ontology can be 
used to support faceted queries of data, to establish computable 
rules about required types of evidence, as a quality control check 
for annotation consistency, and as a mechanism to prevent circular 
annotations rooted only in computational predictions. GO is 
already benefitting from these applications of ECO, and the future 
promises both additional new applications of ECO as well as 
advancements to current ones. 
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Complementary Sources of Protein Functional Information: 
The Far Side of GO 


Nicholas Furnham 


Abstract 


The GO captures many aspects of functional annotations, but there are other alternative complementary 
sources of protein function information. For example, enzyme functional annotations are described in a 
range of resources from the Enzyme Commission (E.C.) hierarchical classification to the Kyoto Encyclopedia 
of Genes and Genomes (KEGG) to the Catalytic Site Atlas amongst many others. This chapter describes 
some of the main resources available and how they can be used in conjunction with GO. 


Key words Function similarity, Protein domain functions, Enzyme Commission (EC), Pathway 
annotation 


1 Introduction 


The Gene Ontology (GO) offers experimental and computational 
biology researchers an accessible range of controlled vocabulary 
annotations to describe protein function. This allows detailed as 
well as large-scale analyses to be conducted. There is, however, a 
range of other sources of functional annotations, which in combi- 
nation with GO provide enhance function descriptions. Examples 
ofsuch complementary resources include the Enzyme Commission's 
classification of enzyme reactions [1], the Kyoto Encyclopedia of 
Genes and Genomes (KEGG) [2], BRENDA [3], CSA [4], 
MACIE [5], MetaCyc database of enzyme and pathways [6], 
amongst many others. Most of these resources include GO terms 
within their own annotations or their definitions are included 
within the Gene Ontology. Mapping terms between resources 
offers enhanced descriptions and relationships between them not 
readily captured solely within GO. The Gene Ontology provides 
many of these mappings through its website (http://geneontol- 
ogy.org/page/download-mappings), which are automatically 
updated with various periodicities depending on how often the 
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corresponding resource is updated. This chapter describes some of 
these complementary resources focusing mainly on enzymes. 


2 Annotating Enzymes 


Due to the over 100 years of experimental biochemical data, one of 
the richest areas for complementary functional annotations are for 
enzymes. Historically, naming conventions for enzymes have been 
confused and haphazard, with several names being given to one 
enzyme and one name being given to several enzymes. Often the 
names bear little information as to the reaction the enzyme is under- 
taking. This led to the development of the Enzyme Classification 
(E.C.) system by the International Commission on Enzymes 
founded in 1956 by the International Union of Biochemistry [1 ]. 
The E.C. number is a hierarchal system consisting of four levels. 
The first level has six divisions giving a broad description of the 
overall chemical transformation (enzyme class): Oxidoreductases, 
Transferases, Hydrolases, Lyases, Isomerases and Ligases. The next 
two levels (sub class and sub-subclass) generally describe the reac- 
tive species and the type of bond being acted upon. The meaning of 
these numbers is class dependent. The final level is a serial number 
for the overall reaction of that sub-subclass. The overall reactions 
described are mass-balanced, as much as possible, though they are 
not necessarily charge-balanced, nor are they meant to represent 
the equilibrium position or reaction direction with a convention 
for writing the reaction in the same direction for all reactions 
within a given sub-subclass even if their physiological direction is 
different. General reactions, where the enzyme has broad specific- 
ity, are given as single generic reactions and alternative reactions 
with specific metabolites are also given. Some reactions are incom- 
plete, while others are combinations of successive reactions [7 ]. 
Thus it is possible that one enzyme E.C. number might have a 
multiple number of reactions associated with it and for many reac- 
tions to be assigned to the same E.C. number (see Fig. la). 
Currently there are 6510 E.C. numbers approved, with 5560 
of them in active use. Of these active annotations only 3924 (70%) 
have an equivalent GO term. A full list of E.C. to GO cross- 
references can be found on the GO website (http://geneontology. 
org/external2go/ec2go). There are a number of reasons why a 
mapping between E.C. and GO cannot be made. Most likely is 
that GO does not yet have a term that covers the EC term, e.g. 
E.C. 1.1.1.287 (d-arabinitol dehydrogenase). An automatic pipe- 
line updates the cross-reference file after each GO release with any 
new terms that are created. Other reasons why E.C. and GO terms 
cannot be mapped are because of E.C. entries being transferred 
from one term to another or the E.C. number has yet to be associ- 
ated with a gene product (termed orphaned E.C. terms). 
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Fig. 1 (a) Examples of ambiguity in the E.C. classification, where one E.C. number can represent many reac- 
tions and where many E.C. numbers are describing one reaction. (b) The two representations of the same 
enzyme (phosphoinositide phospholipase C) in E.C. and GO, with the overall chemical reaction also shown. The 
reaction diagram is highlighted to show sub-structures across the reaction used in the determination of bond 
changes and reaction centers in EC-Blast 


Additionally, there are *pseudo" E.C. terms created by UniProt 
that describe an overall reaction derived from the literature but 
have yet to be included in the E.C. These are easily identifiable as 
they have a letter n in the fourth level ofthe hierarchy, e.g. 1.1.1.n5 
(3-methylmalate dehydrogenase). 

Databases such as KEGG and BRENDA hold details of alter- 
native reactions and data relating to physiological function. Other 
resources hold more specific functional annotations such as the 
catalytic residues and how they function in the overall reactions, as 
cataloged by the Catalytic Site Atlas (CSA), or MACIE that anno- 
tates the steps in an enzyme's reaction, the order in which bonds 
are broken and formed, the role of cofactors and the function of 
protein residues at each step. To bridge the gap between these 
more chemical descriptors and the biological descriptors associated 
with a protein a new ontology, the Enzyme Mechanism Ontology 
(EMO), has been developed [4]. Though not directly linked to 
GO, EMO terms can be determined though links with GOA terms 
of the UniProtKB record for a particular enzyme. 
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3 Comparing Enzyme Annotations 


Unlike GO, the E.C. number cannot be used to make automated 
quantitative comparisons between annotations. There are a number 
of measures of annotation similarity that can be made based on the 
GO ontological graph. The most basic similarity measure is based 
on the length of the common path between two terms to the ontol- 
ogy root and has been enhanced to overcome the fact that the 
depth of a term within the ontology is not necessarily indicative of 
its specificity, termed information content (IC). Further enhance- 
ments normalize the IC measure (Lin score) and use semantic simi- 
larity (Wang score) [8, 9]. To overcome the deficiencies of E.C. as 
a means to measure functional similarity and to capture detailed 
reaction information not encapsulated in GO, new methods have 
been developed. Efforts to compare reactions based on their overall 
reaction chemistry have met with only moderate success, limited by 
their reliance upon the consistency and reliability of the underlying 
reaction data and the ability of the algorithm used to process a 
diverse range of reactions. The latest method called EC-Blast [10] 
has proven more successful. It uses an atom-atom mapping approach 
to automatically assign bond changes and reaction centers (the 
atom and bond type in the immediate region of the metabolite 
where the bonds are broken/formed). This allows for the reaction 
to be described in a set of fingerprints that in composite can be used 
to compare reactions. Taking all available E.C. numbers and equiv- 
alent GO terms that can be compared to each other, the difference 
between the two ways of measuring functional similarity is shown in 
Fig. 2. Though many comparisons result in similar scores, a sub- 
stantial number diverge significantly. For example, E.C. 2.1.2.9 
when compared to E.C. 2.1.2.11, based on bond order changes, 
the similarity score as calculated by EC-Blast is 0.22, where as the 
semantic similarity between the equivalent GO terms is 0.73. The 
low similarity from EC-Blast encapsulates the differences in bonds 
cleaved (two C-N bonds and 2 H-N bonds for E.C. 2.1.2.9; com- 
pared to one C-C, one H-O and one C-H for E.C. 2.1.2.11 as well 
as differences in stereochemistry changes and bond order rearrange- 
ments.) Thus, care needs to be taken in choosing the best measure 
of functional similarity, a widely used technique in functional infer- 
ence (see Chap. 12 [26]). 


4 Annotating Domains 


One of the challenges of functional annotation is the granularity to 
which an annotation can be attached. Most genomic annotations 
are assigned to whole protein translations, i.e. the gene, but for 
many functions it is a protein domain that can be considered the 
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Fig. 2 Differences between GO and EC measures of functional similarity. A fre- 
quency histogram showing the difference between the similarity scores of all- 
by-all pairs of E.C. numbers calculated using EC-Blast bond similarity measure 
and the equivalent GO term. GO similarity scores are calculated using the Wang 
semantic similarity method. Not all E.C. numbers are used as: EC-Blast requires 
fully balanced reactions, and not all E.C. numbers have a GO term equivalents 


functional unit. Of course functions are not solely confined to a 
single domain and many functions are a product of multiple 
domains in combination. Many domains are combined with others 
in increasingly complex combinations and arrangements (see 
Fig. 3). This biological complexity adds considerable complexity to 
functional annotations, where a function can be assigned to com- 
plete gene products and other functional annotation to just one 
component domain or multi-domain combinations. There are a 
number of domain and motif databases that provide functional 
annotations, many of which are mapped to GO via the InterPro 
[11] proteins family database, that integrates predictive models 
from a range of different protein family databases. One of the main 
sequence based domain protein family databases is PFam [12], 
with the goal of creating a collection of functionally annotated 
families that is representative as much as possible ofprotein-sequence 
space. PFam curators provide functional annotations, but in recent 
releases these annotations have been outsourced to the community 
via the use of Wikipedia allowing anyone to freely edit and improve 
the content, with the original curator annotations maintained. By 
their very nature these annotations do not conform to a controlled 
vocabulary, but it is possible for PFam annotations to be mapped 
back to GO terms; this is provided by the InterPro group and is 
available via the GO website. 
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Fig. 3 Biological complexity generated by multi-domain architectures. A force-directed graph of the multi- 
domain architectures associated with a domain superfamily (“winged helix” repressor DNA binding domain). 
The graph is centered on architecture containing just the single domain with nodes (red boxes) radiating from 
this representing ever-increasing multi-domain architecture (shown to the right of the node). A key to the 
domains in these multi-domain architectures is shown on the left identified by PFam codes (starting PF or PB) 
or CATH codes. Functions are associated with the whole gene product as well as for single domains within the 
multi-domain architecture. An interactive version of this graph can be found at http://www.funtree.info/tem- 
plates/showArch.php?cathcode=00001 .00010.00010.0001 0&cathmethod=&cathcluster=&type=AS 


The CATH [13] resource, which uses protein structures to 
define domains both within known protein structures and sequences 
where there is no structural information, uses the GO terms associ- 
ated with a sequence to define functionally coherent clusters 
(termed FunFams) within the superfamily division of the classifica- 
tion. The functional annotation provided is derived from the pre- 
dominant GO term found within the FunFam. These terms though 
are assigned to the whole sequence and not the domain and there- 
fore may not directly relate to the specific function the domain is 
participating in. In the SFLD [14] domains that are critical for 
function are determined (often being used to define the superfamily), 
thereby linking the functional annotation to a domain or 
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combination of domains within a multi-domain architecture (see 
Chap. 9 [27]). SUPERFAMILY [15], a domain centric resource 
that uses an alternative structure based domain classification called 
SCOP, attempts to assign functional annotations specifically to a 
domain. Using the GO semantic structure and the proteins multi 
domain architecture, domain-centric functional annotations are sta- 
tistically inferred based on the assumption that if a GO term is 
annotated to proteins that contain a shared domain then that term 
should also confer functional indicators for that domain. The 
SUPERFAMILY developers have generated a reduced version of 
GO for annotating domains and forms part of a structural domain 
functional ontology (SDFO) [16]. The approach of linking onto- 
logical terms to a domain can be generalized to other ontologies, 
most notably for phenotypic annotations. For example 
SUPERFAMILY integrates mammalian phenotype ontology 
(MPO) [17] from the mouse genome informatics (MGI) and the 
Human Phenotype Ontology (HPO) from the (OMIM) [18] 


resource. 


5 Pathways and Interactions 


Individual components of a pathway or groups of interacting pro- 
teins are described by the molecular function set of GO terms, 
while the pathways and interactions these components participate 
in are captured in the biological process GO terms. These provide 
overall descriptions of a biological process, such as signal transduc- 
tion, or more specific terms such as thiamine metabolism. GO does 
not try to represent the dynamics or dependencies that are equiva- 
lent to a signal or metabolic pathway, though the GO consortium 
has recognized the importance of contextualizing gene product 
annotations and had begun to add some directional information 
(see Chap. 17 [28]). To be able to put the components into the 
context of a metabolic pathway for example, the use of specialist 
databases such as KEGG, BioCarta, MetaCyc, Pathway Interaction 
Database [19] and Reactome [20] is required (see Table 1). These 
provide curated and computationally derived descriptions of over- 
all topologies and interactions, often displayed as pathway dia- 
grams and maps. Many of these data resources are able to map 
terms back to GO. IntAct [21], which is a molecular interaction 
database curated from the literature or by data depositors, scores 
and filters interaction evidences to generate a high confidence sub- 
set of molecular interactions that are exported to GO. 
Combinations of GO terms and pathway/interactions data- 
bases can be used in the analysis of proteomics data for functional 
annotation. This can be achieved either using methods for GO 
enrichment analysis and subsequently linking the results to exter- 
nal pathway resources [22] or by dynamically constructing the 
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6 Conclusions 


pathway/interaction network based on the gene list of interest to 
create a functionally organized GO/pathway term network [23]. 
Additionally proteins participating in common biological processes 
or sharing molecular functions are predictive of interactions [24]. 
Many methods that combine semantic similarity and machine 
learning techniques have been developed to use GO to predict 
PPIs (see ref. 25 and references therein). 


The Gene Ontology provides a rich set of ontological terms to 
describe many aspects of a protein’s function. Many of these terms 
have equivalences in more specialist resources that like the Gene 
Ontology collate primary data derived from the literature. Often 
these resources include functional annotations that are not directly 
captured in GO or allow for annotations to be collated around a 
different functional unit, as in the case of protein domain centered 
functional annotations. Other types of functional descriptors such 
as the dependencies in metabolic pathways and protein-protein 
interactions are not explicitly captured in GO (though this is cur- 
rently being addressed through GO annotation extensions), but in 
combination with other resources can be used to provide and 
enhance functional annotation of proteins. 
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Integrating Bio-ontologies and Controlled Clinical 
Terminologies: From Base Pairs to Bedside Phenotypes 


Spiros C. Denaxas 


Abstract 


Electronic Health Records (EHR) are inherently complex and diverse and cannot be readily integrated and 
analyzed. Analogous to the Gene Ontology, controlled clinical terminologies were created to facilitate the 
standardization and integration of medical concepts and knowledge and enable their subsequent use for 
translational research, official statistics and medical billing. This chapter will introduce several of the main 
controlled clinical terminologies used to record diagnoses, surgical procedures, laboratory results and medi- 
cations. The discovery of novel therapeutic agents and treatments for rare or common diseases increasingly 
requires the integration of genotypic and phenotypic knowledge across different biomedical data sources. 
Mechanisms that facilitate this linkage, such as the Human Phenotype Ontology, are also discussed. 


Key words Electronic health records, Clinical terminologies, Phenotypes 


1 Introduction 


We are arguably entering the era of data-driven, personalized med- 
icine, where electronic health records are considered the transfor- 
mational force for measuring and improving the quality of clinical 
care and accelerating the pace of biomedical research [1, 2]. 
Electronic Health Record (EHR) data, alternatively referred to as 
Electronic Medical Record (EMR) data, are broadly defined as 
electronic data that are generated, captured and collected as part of 
routine clinical care across primary, secondary, and tertiary health 
care settings. EHR data can be structured (i.e., recorded using 
clinical terminologies), semi-structured (e.g., laboratory test 
results), or unstructured (e.g., free text). EHR data present mul- 
tiple opportunities that have the potential to transform medical 
practice and research across all stages of translation [3-6]. 

Health care is an intrinsically multidisciplinary process and the 
care of patients, even within a single clinical specialty, intimately 
involves clinicians from a diverse set of other specialties (e.g., physi- 
cians, surgeons, radiologists, pharmacologists). Patient interactions 
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often occur within distinct health care settings: some diseases are 
almost exclusively managed in primary care while acute manifesta- 
tions are usually treated in secondary care. For chronic conditions, 
such as cardiovascular diseases, patients may have multiple interac- 
tions within primary and secondary care, and undergo assessments 
and diagnostic tests across both settings over long periods of time. 
The amount of EHR data being digitally generated and collected 
are thus vast and rapidly expanding but lack a common structure to 
facilitate their use, both for care across clinical settings but also for 
research, auditing, and other administrative purposes. 

The purpose of this chapter is to provide a brief introduction 
to clinical terminologies for capturing and representing different 
aspects of clinical care in electronic health records. Firstly, contem- 
porary terminologies for recording diagnoses, surgical procedures, 
lab measurements, and medication are described. Secondly, the 
main applications and challenges of using clinical terminologies are 
set out. Lastly, a potential pathway for integrating clinical termi- 
nologies with biological ontologies is illustrated through a case 
study in breast cancer. 


2 Controlled Clinical Terminologies 


21 Diagnoses 


Similar to bio-ontologies, such as the Gene Ontology [7, 8], con- 
trolled clinical terminologies (Table 1) were created to facilitate 
the systematic capture, curation, and description of health care- 
related concepts encountered during clinical care [9]. These can 
include but are not limited to diagnoses, symptoms, anatomical 
terms of location, prescribed medications, medical tests, surgical 
procedures, and laboratory measurements. Clinical terminologies 
are considered the conceptual core of clinical information systems 
and an essential tool for facilitating clinical data integration and 
reuse amongst disparate data sources. Initiatives such as the Open 
Biomedical Ontologies Consortium (OBO) [10] were founded to 
coordinate their evolution and alignment and provide a set of 
guidelines for creating and maintaining them with the aim of estab- 
lishing an ecosystem of interoperable entities. 

Several systematic literature reviews provide in-depth detail on 
their different aspects and characteristics [11-16]. A brief descrip- 
tion of some key terminologies is provided below. 


SNOMED-Clinical Terms (SNOMED-CT) [17, 18] contains repre- 
sentations for over 300,000 health care-related concepts and is 
designed to capture and represent patient data for clinical care. It 
consists of four primary components that define the structure of the 
recorded information: concepts, descriptions, relationships and refer- 
ence sets. Concepts are the basic unit of describing health care-related 
information and are uniquely identified, e.g., the Myocardial 
Infarction concept (id 22298006). All concepts have a unique 
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Table 1 


Common clinical terminologies, classification systems, and ontologies used in electronic health records 


Terminology 


CPT 


Information 


Name: Current Procedural Terminology 


DSM-5 


ICD-10 


LOINC 


MedDRA 


MeSH 


NCIT 


OPCS 


Read 


RxNorm 


SNOMED-CT 


UMLS 


Context: surgical procedures 
Website: http: //www.ama-assn.org/go/cpt 


Name: Diagnostic and Statistical Manual of Mental Disorders—version 5 
Context: mental health diagnoses 
Website: http:/ /www.dsm5.org/ 


Name: International Statistical Classification of Diseases and Related Health 
Problems—1 0th revision 

Context: diagnoses 

Website: http: //www.who.int/classifications/icd/en/ 


Name: Logical Object Identifiers and Codes 
Context: laboratory measurements 
Website: https://loinc.org/ 


Name: Medical Dictionary for Regulatory Activities 
Context: biopharmaceutical regulation 
Website: http:/ /www.meddra.org/ 


Name: Medical Subject Headings 
Context: life sciences literature indexing 
Website: https:/ /www.nlm.nih.gov/mesh/ 


Name: National Cancer Institute Thesaurus 
Context: biomedical concepts related to cancer 
Website: http:/ /ncit.nci.nih.gov/ 


Name: OPCS Classification of Interventions and Procedures 
Context: surgical procedures 
Website: http:/ /systems.hscic.gov.uk/data/clinicalcoding/codingstandards/opcs4 


Name: Read Codes, Clinical Terms 
Context: all health care related concepts 
Website: http:/ /systems.hscic.gov.uk/data/uktc/readcodes 


Name: RxNorm 
Context: US clinical drugs 
Website: http:/ /www.nlm.nih.gov/research/umls/rxnorm/ 


Name: Systematized Nomenclature of Medicine-Clinical Terms 
Context: all health care related concepts 
Website: http:/ /www.ihtsdo.org/snomed-ct 


Name: Unified Medical Language System 
Context: clinical terminology mappings 
Website: http:/ /www.nlm.nih.gov/research/umls/ 


Fully Specified Name, a list of Preferred Terms (e.g., Myocardial 
Infarction), and Synonyms (e.g., Heart attack, Cardiac infarction) 
defined. Concepts are organized into an acyclic hierarchy of is-a rela- 
tionships that enables multiple inheritance i.e. concepts can have 
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22 Procedures 


2.3 Laboratory 
Measurements 


multiple parent concepts. For example Myocardial Infarction (id 
22298006) is a subclass of the concepts Necrosis of anatomical 
site (id 609410002), Ischaemic heart disease (414545008), and 
Myocardial disease (id 57809008). SNOMED-CT contains terms 
for describing clinical findings, symptoms, diagnoses, procedures, 
medication, devices and anatomical body structures. It provides a 
compositional syntax which allows multiple ontology terms to be 
combined in order to build composite terms to represent complex 
medical concepts, a process known as post-coordination. Significant 
variation exists internationally with regards to SNOMED-CT adop- 
tion and implementation [19] and its use for research or routine clini- 
cal care. In the UK National Health Service (NHS), SNOMED-CT 
has been designated to become the standard clinical terminology to 
be used across the entire health care system by 2020. 

The International Statistical Classification of Diseases and 
Related Health Problems (ICD) is a statistical classification system 
maintained by the World Health Organization [20]. ICD encapsu- 
lates concepts for classifying diseases, signs and symptoms, abnormal 
investigation findings, complaints, interactions with the health care 
system, social circumstances, and external causes of injury or disease. 
It maps health conditions to corresponding generic categories 
together with specific variations, assigning for these a designated 
alphanumeric code, up to six characters long. Major categories are 
designed to include a set of similar diseases (e.g., ICD chapter “I” 
encapsulates all diseases of the circulatory system). It is currently the 
most widely used statistical classification system in the world with 
many countries developing their own extensions and modifications 
tailored to their local health care system (e.g., ICD-9-CM used in 
the USA [21 ]). The primary use case of ICD is to abstract EHR data 
by assigning unique codes to diagnoses and procedures. This pro- 
cess is known as clinical coding, and performed manually or algorith- 
mically by specialist staff according to a prespecified protocol. Coded 
data are then utilized for research [22], official statistics [23], medi- 
cal billing, and health care resource planning. 


Clinical terminologies are used for describing surgical procedures, 
interventions, and investigations that patients undergo in hospi- 
tals, during in patient and outpatient interactions. In the USA, the 
American Medical Association maintains the Current Procedural 
Terminology [24] (CPT) and in the UK, the OPCS Classification 
of Interventions and Procedures version 4 (OPCS-4) [25] is used 
by the National Health Service. Both terminologies are used to 
convey information with regards to procedures to physicians and 
clinical coders and are combined with diagnosis codes during the 
medical billing process. 


Logical Observation Identifiers Names and Codes (LOINC) [26-28] 
is maintained by the Regenstrief Institute and used for describing 
medical laboratory observations. LOINC facilitates the exchange of 
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information with regards to laboratory tests and results between 
health care providers, laboratories and public health agencies. 
LOINC terms correspond to a single test, panel, observation, or 
measurement and are uniquely identified by a numeric code. Terms 
are formed of six parts: component (what is being measured), prop- 
erty (characteristics of what is being measured), time (measurement 
temporal information), system (observation context or specimen 
type), scale (scale of measure), and method (procedure used to 
obtain the measure). 


RxNorm [29] is a US-specific terminology developed by the Library 
of Medicine for describing information about clinical drugs (defined 
as pharmaceutical products taken by patients with a therapeutic or 
diagnostic intent). It provides normalized names for all clinical drugs 
and links information about their active ingredient(s), strengths, 
form, and branded versions. RxNorm is widely used for recording 
drug information in patient health records, exchanging information 
between health care providers [30], personal medication records 
[31], and medication-related clinical decision support [32] and con- 
tains cross-references to other commonly used drug vocabularies. 


3 Uses of Clinical Terminologies 


3.1 Opportunities 


While clinical terminologies are primarily used for the purposes of 
clinical data standardization and integration, the provision of a sys- 
tematic and common language for describing health care concepts 
enables the subsequent use of EHR data for a diverse set of pur- 
poses, such as clinical research, auditing and billing. Adoption of 
clinical terminologies worldwide varies across health care settings 
and by purpose but diagnostic and procedural classification sys- 
tems are primarily used for medical billing purposes. This section 
will briefly describe the opportunities and challenges of using EHR 
data and clinical terminologies. 


EHR data are increasingly being linked and used for translational 
research [33] as they offer larger sample sizes at a higher clinical 
resolution [34]. A primary use-case of linked EHR data is to accu- 
rately extract phenotypic information (1.e., disease status), a process 
known as phenotyping [35]. Identifying cohorts of patients that 
share a common characteristic (e.g., have been diagnosed with 
hypertension or have abnormally high blood glucose measure- 
ments) enables researchers to use EHR data to perform large-scale 
clinical research studies at a lower cost compared to traditional 
bespoke investigator-led studies. EHR data have been used to 
examine disease aetiology in relation to clinical risk factors [36, 37 ] 
or genotypic information [38, 39], develop disease prognosis mod- 
els [40], perform health outcome comparisons between countries 
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[41], and facilitate pragmatic clinical trials [24]. Clinical terminolo- 
gies are heavily used by deterministic rule-based algorithms curated 
by experts for identifying and constructing patient cohorts from 
raw EHR data but data-driven methodologies are increasingly 
being utilized [42]. Comprehensive reviews provide additional 
information on the use of clinical terminologies for other purposes 
such as annotating and accessing medical knowledge sources, data 
integration, semantic interoperability, data aggregation, and clinical 
decision support systems [43-46]. 


3.2 Challenges Merging EHR data across sources becomes challenging due to the 
differences in the manner in which data are recorded. Each health 
care setting generates and records data for a particular purpose 
using the clinical terminology that is optimal in that specific con- 
text. For example, information in primary care can be recorded 
using SNOMED-CT whereas hospital morbidities would be 
recorded using ICD-10. This mismatch between the clinical termi- 
nologies used to record information leads to significant challenges 
as information is recorded at varying levels of granularity across 
sources. Semantic mapping systems, such as the Unified Medical 
Language System [47] (UMLS), can provide further details on the 
relationship between terms in each clinical terminology and facili- 
tate the translation or integration of information across sources. 
However, direct one-to-one mappings might not always exist 
between terminologies leading to information loss due to insuffi- 
cient resolution or conflicts between two sources where multiple 
potential mappings exist. These issues and their severity vary by 
clinical speciality and context but often require a set of rules to be 
created by users and manually applied in order to resolve them 
before the data can be used for research purposes. In cases of 
incomplete mappings, synonyms or adjacent terms in the clinical 
terminology might be used as a replacement term but that is 
assessed on a case-by-case basis. 


4 Integrating Biological and Clinical Data 


A key challenge in genomics is to understand and elucidate the phe- 
notypic consequences of variation observed in the genotypic level. 
Even among Mendelian diseases, the association between genotype 
and phenotype is often complex. With the advent of next-genera- 
tion sequencing methods, the focus is now shifting from generating 
genomic sequence data to efficiently interpreting them. 

From a clinical care perspective, diseases presented by patients 
can be phenotypically distinct and associated with a specific set of 
treatments, symptoms, investigative procedures and management 
strategies. From a molecular scientists perspective however, it might 
be appropriate to group and analyze diseases that share a common 
biological pathway as a single entity in order to discover similarities 
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in the way they manifest in different patient groups. Both of these 
viewpoints are valid, but as a direct consequence, data describing 
phenotypic and molecular properties are recorded in a different, and 
often incompatible, manner [48]. The problem is exacerbated in 
rare diseases where researchers are required to create larger cohorts 
of patients by pooling data across research consortia in order to 
increase the sample sizes and obtain accurate estimates of risk. 

Increasing amounts of molecular function knowledge are being 
recorded in a hierarchical manner, using bio-ontologies such as the 
GO, which offer a rigid way to represent knowledge in a machine- 
readable manner, interoperable between different data sources and 
annotated [11]. Scientists aim to link and integrate this with pheno- 
typic information in order to elucidate the genotype-phenotype 
relationship and facilitate the discovery of novel therapeutic agents 
and treatments for common or rare disorders. Ontologies such as 
the Human Phenotype Ontology (HPO) [49, 50] and the Disease 
Ontology [51, 52] were created to provide streamlined disease defi- 
nitions by systematically combining the diverse and heterogeneous 
knowledge contained within clinical terminologies and other anno- 
tation sources under a single framework. These tools aim to provide 
researchers with a rich resource that semantically links diverse dis- 
ease definitions from clinical terminologies and enables the linking 
of phenotypic, genotypic and genetic information of a disease. 


The HPO is a structured, curated ontology describing phenotypic 
abnormalities and the relationships between them. The HPO aims to 
act as scaffolding for enabling the interoperability between molecular 
biology and human disease by providing a centralized resource for 
integrating genotypic and phenotypic data across biomedical sources. 
The HPO enables the computational analysis of human (and model 
organism) phenotypes against the background biological and molecu- 
lar knowledge incorporated in biological ontologies such as the GO. 

The HPO is organized as three independent sub-ontologies that 
cover different domains with the largest one being the one describ- 
ing phenotypic abnormalities. The other two sub-ontologies describe 
the mode of inheritance and the onset and clinical course of the 
abnormalities. The primary focus of the HPO is not to capture dis- 
eases but rather the phenotypic abnormalities that are associated 
with them. Each HPO term describes a phenotypic abnormality 
(e.g., Primary congenital glaucoma) and is assigned a unique persis- 
tent identifier (e.g., HP:0001087). HPO terms are related to parent 
terms by “is a” relationships and terms can have multiple parent 
terms. The HPO is not primarily designed to capture and document 
quantitative information (e.g., systolic blood pressure, body mass 
index) but does provide qualitative descriptions of excess or reduc- 
tion in quantity leading to a phenotypic abnormality (e.g., markedly 
reduced T cell function). 

Interoperability between molecular and phenotypic data and 
research areas is accomplished through a comprehensive set of term 
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42 From Base Pairs 
to Bedside 
Phenotypes: Breast 
Cancer Case Study 


Genomic 
variation 


Annotation: 


HUGO/HGNC 
RefSeq 
GeneLoc 
Geneld 


Resources: 
dbSNP 


Genotype Phenotype 


annotations. The majority of HPO terms contain a reference to the 
Unified Medical Language System [47], enabling the mapping of 
terms between controlled clinical terminologies and other sources 
in the UMLS Metathesaurus. Additionally, HPO terms contain 
annotations that provide pointers to specific diseases or genes cre- 
ated in other external knowledge sources such as Online Mendelian 
Inheritance in Man (OMIM) database (http://omim.org/), 
DECIPHER (https://decipher.sanger.ac.uk/), and Orphanet 
(http://www.orpha.net/). HPO annotations have a number of 
metadata fields associated with them for further specifying onset, 
frequency and quantifying modifier effects. Annotations evidence 
codes, analogous to GO Evidence Codes, describe the manner in 
which a particular annotation was assigned to a term (e.g., inferred 
by text mining, traceable author statement, inferred from electronic 
annotation, public clinical study). 


Using malignant neoplasms of the breast as a hypothetical case 
study, this section presents a potential pathway of linking biologi- 
cal knowledge on genotypic variation and molecular functions to 
clinical phenotypes encountered within the health care system. 
Drilling down from the right-hand side of clinical phenotypes 
down to the left-hand side of genotypic variation, 

Figure 1 illustrates details of all potential sources and annota- 
tion mechanisms used within each source to capture and record 
information. 


Genotypic information. HPO annotations provide a cross-link to the 
Online Mendelian Inheritance in Man (OMIM) Breast Cancer, 
Familial phenotype entity (OMIM #114480—URL www.omim. 
org/entry/114480). OMIM provides curated lists of disease phe- 
notypes and genes associated with that phenotype, in this case for 
example the BRCA2 gene entry (OMIM *600185—www.omim. 


Clinical 
phenotype 


Annotation: Annotation: Annotation: 
GO HPO SNOMED-CT 
DO NCTIT 
RxNorm 
CPS,OPCS 
Resources: Resources: Resources: Resources: 
Entrez UniProt OMIM UMLS 
Ensembl Orphanet 


Fig. 1 Along one potential path from genomic variation to genotypic information, transcripts and phenotypic 
information observed in clinical care there are multiple annotation mechanisms that are being utilized to 
record information in a structured way and enable the machine-driven interoperability between different 


platforms 
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org/entry/600185). Additionally, entries provide cross-links with 
Entrez [53] (Gene ID 675—URL http://www.ncbi.nlm.nih.gov/ 
gene/675) and Ensembl [54] (ENSG00000139618—URL 
http://www.ensembl.org/Homo. sapiens/Gene/Summary?g-EN 
$G00000139618;1213:32315474-32400266). Breast Cancer 2, 
early onset (BRCA2) is a protein-coding gene and belongs to the 
Fanconi anemia, complementation group (FANC) family of genes. 


Genotypic variation: The NCBI dbSNP (http://www.ncbi.nim. 
nih.gov/SNP/) provides curated and annotated information link- 
ing Single Nucleotide Polymorphisms (SNPs) and individual 
genes. rs144848 is one of the multiple mutations in the BRCA2 
gene that have been reported to represent an independently minor 
but cumulatively significant increased risk for developing breast 
cancer [55]. dbSNP provides information the SNPs location (e.g., 
chromosome and chromosomal position), source assays, discor- 
dant genotypes and population diversity. (URL http:/ /www.ncbi. 
nlm.nih.gov/projects/SNP/snp ref.cgi?rs-rs144848) 


Molecular function: UniProt [56] provides information on gene 
transcripts, in this case BRCA2 HUMAN (P51587, Breast cancer 
type 2 susceptibility protein). The biological process and molecular 
functions of the gene product are annotated using the Gene 
Ontology: double-strand break repair via homologous recombination 
(GO:0000724), DNA Repair (GO:0006281), cytokinesis 
(GO:0000910), protease binding (GO:0002020), and positive reg- 
ulation of transcription, DNA-templated (GO:0045893). Using 
the GO, researchers are able to identify other gene products that 
share a common biological pathway or molecular function and 
incorporate that knowledge in their experiments. (URL: http:// 
www.uniprot.org/uniprot/P5 1587) 


Phenotypic information. The HPO Breast carcinoma term (HP: 
0003002—http://purl.obolibrary.org/obo/HP_0003002) defines 
the presence of a carcinoma of the breast and is a child node of 
Neoplasms of the breast (HP:0100013). The HPO term contains a 
cross-reference to the Unified Medical Language System (UMLS) 
Malignant Neoplasm of Breast (UMLS:C0006142—URL https:// 
uts.nim.nih.gov//metathesaurus.html#C0006142;0;1;CUI;2015AA 
EXACT MATCH;*;) Concept which in turns provides mappings to 
other major controlled clinical terminologies such as the International 
Classification of Diseases 10th revision (C50, Malignant neoplasm of 
breast—http://apps.who.int/classifications/icd10/browse/ 
2010/en#/C50-C50) and SNOMED-Clinical Terms (254837009, 
Malignant tumor of breast—http:/ /bioportal.bioontology.org/ 
ontologies/SNOMEDCT?p-classes&conceptid-254837009). 


Clinical phenotype: Oncology data in hospitals are stored in diverse 
locations and formats since diagnosis and treatment is a multidisci- 
plinary process between pathology, radiology, surgery, medical 
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oncology and radiotherapy. Breast cancer diagnosis and severity is 
usually evaluated through imaging tests such as mammograms, 
ultrasounds, magnetic resonance imaging or by performing a 
biopsy. Medical images and their associated metadata are stored in 
a picture archiving and communication system (PACS) system and 
information about these procedures and the results obtained would 
be recorded using intervention and procedure terms. Diagnosis and 
staging information would be stored and coded in pathology sys- 
tems using a medical terminology such as SNOMED-CT or other 
bespoke data structures. Treatment data would be stored in the 
pharmacy information systems. 


The amount of clinical data that are generated and captured during 
routine clinical care is increasing in size and complexity. Integrating 
clinical data from disparate sources however is a challenging task due 
to their lack of common structure and annotation. Similar to the 
Gene Ontology, controlled clinical terminologies have been created 
to facilitate the systematic capture, curation, and description of health 
care related events such as diagnoses, prescriptions and procedures 
from EHR data and enable their subsequent usage for clinical care, 
research, or administrative purposes. Furthermore, linking EHR data 
with biological knowledge is increasingly becoming possibly through 
tools such as the Human Phenotype Ontology (HPO) and the 
Disease Ontology that aim to provide the semantic scaffolding for 
computationally integrating biomedical knowledge across sources. 
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Part VII 


Conclusion 


Chapter 21 


The Vision and Challenges of the Gene Ontology 


Suzanna E. Lewis 


Abstract 


The overarching goal of the Gene Ontology (GO) Consortium is to provide researchers in biology and 
biomedicine with all current functional information concerning genes and the cellular context under which 
these occur. When the GO was started in the 1990s surprisingly little attention had been given to how 
functional information about genes was to be uniformly captured, structured in a computable form, and 
made accessible to biologists. Because knowledge of gene, protein, ncRNA, and molecular complex roles 
is continuously accumulating and changing, the GO needed to be a dynamic resource, accurately tracking 
ongoing research results over time. Here I describe the progress that has been made over the years towards 
this goal, and the work that still remains to be done, to make of the Gene Ontology (GO) Consortium 
realize its goal of offering the most comprehensive and up-to-date resource for information on gene 
function. 


Key words Gene Ontology, Gene function, Genomics, Biological modeling 


1 Motivation 


From their outset in the early 1990s it was obvious that biological 
databases demanded a methodical way of describing the function 
of genes. For one thing, a model system's raison @ etre was to gain 
insight into human health and, in the days before entire genomes 
and proteomes were available, the relevant connections to human 
biology were largely based on textual descriptions of biological 
role. In conjunction, as genomes such as yeast were being com- 
pleted, new laboratory techniques were being developed for sur- 
veying the genome, such as microarray expression panels, and these 
data cried out for systematic description of the voluminous results. 
Finally, lest we forget, this period also saw the advent of the “World 
Wide Web.” The early pioneers in biological databases were quick 
to take advantage of the latest technologies for data dissemination 
(much easier than shipping a copy of GenBank on tape or disk 
drive as was the norm), but exchanging data in a rational and effi- 
dent manner required concomitant syntactic and semantic 
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agreement. Those of us building these data resources (Including 
Amos Bairoch, Jonathan Bard, David Botstein, Michelle Gwinn, 
Minoru Kanehisa, Stan Letovsky, and Monica Riley) were avidly 
discussing what might be done. Biologists needed a way of making 
some sense of the information we were so diligently collecting 
about genes, both to locate information and to traverse across taxa. 

Specifically one slightly obsessive biologist, Michael Ashburner, 
wanted to classify all fly genes and have the corresponding worm, 
mouse, human, yeast groups use the same classification scheme 
(see ftp://ftp.ebi.ac.uk/pub/databases/edgp/misc/ashburner/ 
fly_function_tree for an early example, and ftp://ftp.geneontol- 
ogy.org/pub/go/www/gene.ontology.discussion.shtml for the 
white paper as it was first publicly presented in 1998). That way, if 
he found a fly gene involved in a particular process, he could then 
ask what genes in other taxa are (thought) to be involved in the 
“same” process, and what insights can be gleaned from its counter- 
part? We needed a way to describe the attributes of gene products 
in a rigorous way that would enable biologists to roam the universe 
of genomes and biology, to explore: temporally and spatially char- 
acteristic expression patterns; the specific (often) cellular compart- 
ment localization where they acted; whether they were constitutive 
parts of particular cellular components and/or complexes; and 
their biochemical or physiological functions and activities. These 
are attributes of genes that are of great interest to all biologists. 
And in an ideal world all biological databases would agree on how 
such information can be made discoverable and comparable. 


2 Desiderata (Principles) Circa 1996-1997 (Banbury & Les Treilles) 


Two seminal workshops were organized in 1996 and 1997 largely 
devoted to discussing the need for agreement among the genomic 
resources on how semantic comparability should be achieved. The 
first of these was sponsored by the Banbury Center,' (organized by 
M. Ashburner, E. Harlow, P. Karp and J. Witkowski), and the sec- 
ond on building genome databases sponsored by the Fondation 
des Treilles (organized by W.M. Gelbart, and M. Ashburner). 
These meetings set the stage for the Gene Ontology Consortium 
by defining our working definitions and essential principles. 

These axiomatic working definitions, begin with *gene prod- 
uct”: a physical object, typically associated with a gene or genes indi- 
rectly through transcription and translation (for proteins), affecting 
some biological process. Such things as proteins, ncRNAs, protein 
complexes, and so forth are all typical functional objects. These were 
the objects to be described. In turn the essential attributes of a gene 
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product—its function, the process(es) it participates in, and the cel- 
lular location at which these occur—were also defined: Function 
being a capability that a physical gene product carries as a potential, 
describing only what a gene product can do, without necessarily 
specifying where or when this usage actually occurs; Process as a 
transformation that has a temporal aspect to it, even if virtually 
instantaneous, accomplished via one or more ordered assemblies of 
functions; And (originally) cellular component as an anatomical 
structure within the cell, a location in which a function or process 
occurs (since expanded to include extracellular space). 

Following agreement on these basic definitions came the ani- 
mated discussions on the desired (and required) characteristics for 
actual operations. 


3 Essentials for the “Ontology” 


3.1 Unique IDs 


3.2 Graph Structure 


3.3 Human Readable 
Definitions and Labels 


The name “Gene Ontology” was originally a jest, but the joke was 
on us, as it turns out GO is indeed an ontology—at least in the 
computational sense, with the primary operational data structure 
now being OWL. Every attribute mandated at the outset has 
proven its worth and remains at the core of the GO. Some of these 
essential criteria are outlined here. 


It was rapidly understood that unique identifiers were essential 
operationally. This allowed the collaborating resources to reference 
the ontology classes (terms) unambiguously and stably. Furthermore 
by using a semantically meaningless identifier, as opposed to using 
the label as the identifier, we were free to change the label at any 
time, and to display different preferred labels for different com- 
munities. At the time this was a major difference compared to 
other frame based systems such as *Ontolingua" or even Ontology 
Web Language (OWL, although OWL did not exist at the time) 
which used the label (name) as the identifier. 


It was also determined that it would be essential for the GO terms to 
have a graphical relationship to each other, rather than the prevalent 
norm in biology at the time: a flat list of keywords used for tagging. 
In the early, consciously simplistic, model GO began with there were 
only two relationship types: is_a and part_of. But it was recognized 
even then that more relationships would ultimately be required. 


The decision to make numerical identifiers the stable GO “object” 
had implications for the human readable labels. And, in addition, 
rather than attempting to convey all pertinent biological informa- 
tion by encoding it directly into the label, human readable defini- 
tions would provide the definitive definition. Thus it is the definition, 
not the label, which defines an ontology class in GO. If a label 
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3.5 Versioning 
the Ontology 
and Classes: 
The History 

of Changes 


changes it means nothing and there are no serious consequences. 
If a definition changes, such that the meaning of the class has 
changed, then this has obvious consequences for any gene product 
that was annotated to the original class. Thus the original class is 
made obsolete, with a reference to the new class as a suggestion, 
and the new class is given a new identifier. 

Another misconception that often needs to be clarified is that 
GO has nothing to do with nomenclature. The confusion arises 
because we are using (and often have to use) exactly the same 
words to describe both the product and its function. For example, 
“alcohol dehydrogenase” can describe what you can put in an 
Eppendorf tube (the gene product) or it can describe the function 
of this protein. There is, however, a formal difference—a “gene 
product” has (potentially) a many-to-many relationship with a 
“function.” That is to say there are many gene products that have 
the function “alcohol dehydrogenase” (and some of these may 
indeed be encoded by a gene with the name alcohol dehydroge- 
nase, but many will not be). Moreover a particular gene product 
may have both functions “alcohol dehydrogenase” and “acetalde- 
hyde dismutase” and possibly more. Since GO’s remit is describing 
functions and processes, nomenclature is irrelevant to its purpose. 

Finally, the labels themselves are intended to be familiar to 
researchers using GO. Over the years some unfortunate “standard- 
ization” efforts, have rendered terms non-user-friendly (for exam- 
ple what researchers call a transcription factor is “sequence-specific 
DNA binding RNA polymerase II transcription factor activity” in 
the GO). The consequence is that both annotation and searching 
are made more error-prone and difficult because the familiar term, 
that a biologist would instinctively use, cannot be quickly located. 
The GO Consortium continues working to rectify these labeling 
issues, both by an effort to use familiar labels and through the judi- 
cious use of synonyms. 


Multiple synonyms of different flavors are essential for allowing 
GO to deal with: colloquialisms, community preferences, abbrevia- 
tions, legacy names, the multiple ways of referring to chemical ele- 
ments, capitalization, and all the possible variations that occur in 
natural language. Because our top priority was communication of 
biological knowledge, we needed GO to accommodate every indi- 
vidual researcher by speaking in their particular idiom. 


In 2000 we began to maintain a history of the ontology and of 
each term. Comprehensive snapshots of both the ontology and 
the annotations are taken on a monthly basis enabling progress to 
be quantified and retrospective analyses to be carried out. 
Additionally, from the outset, date stamping and authorship for 
each class were captured. Originally, and currently, the form is 


3.6 Slims 
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rather rudimentary: (Modified|Added|Deleted|Split from |Merged 
with) by firstnameorinitial,surname yymmdd. This early decision 
to support “micro-attribution” remains valid, but the form is 
gradually transitioning into a more modern approach through the 
development of online editing and annotation tools with authen- 
tication and authorization. 


From the outset members of the community were asking for sub- 
sets of the GO containing only the major categories and subcatego- 
ries, or a branch of relevance to their particular application. These 
“Slims” enable the users to broadly group their gene products using 
a very limited set of broad categories, or confine themselves to spe- 
cific branches dealing with a particular biological topic, or constrain 
the GO by a taxonomic criterion. “Slims” are handled internally by 
tagging the different GO classes as members of various categories. 
These GO subsets are used in multiple different ways: for high-level 
classification; for defining sub-branches at the finest granularity; for 
clade specific versions; and other utility subsets. 


We determined that collaborating databases would be responsible 
for attributing any functional assignment to a source (e.g., a litera- 
ture reference or computational analysis) and for indicating the 
type evidence used by this attribution source. The initial set of 
“evidence codes” was primed from this short list: 


e Inferred from genetic interaction with 
e Inferred from protein interaction with 
e Inferred from sequence similarity with 


e Inferred from direct assay 


This enabled statements such as “Publication NNN” asserted 
that “gene A” has “function XYZ” by inference from a “direct 
assay.” Since this time evidence codes have developed into an 
autonomous ontology [1] and discussed in Chap. 18 [2] but the 
principle remains the same: if you are asserting that something is 
true then you must provide the evidence—its general category and 
the published reference—for making this assertion. 


GO did not arise from nothing. Like every technology it used what 
came before it. Furthermore, given that we wanted to give attribu- 
tion to our predecessors and provide a migration path for anyone 
with legacy data that had utilized these prior vocabularies. This 
practice came out of our own need as well. As the ontology was 
being built up we wanted to track some of our original sources. 
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Thus references to Monica Riley’s functional categories for E. coli 
[3], Enzyme Commission numbers (EC—http://www.chem. 
qmul.ac.uk/iubmb/enzyme/) and SwissProt keywords (SP—http:// 
www.uniprot.org/docs/keywlist) helped to bootstrap GO at the 
outset, and additional cross-references, such as Medical Subject 
Headings (MESH—https://www.nlm.nih.gov/mesh/) [4], were 
added shortly thereafter to aid in interoperability. 


Another expressivity requirement was to allow assertions stating 
that a given gene product does not hold for a given GO class. 
Experimentalists often test for an expected function, with negative 
results. Rather than lose this information we needed to provide a 
solution that could convey such negative results. Hence we pro- 
vide for qualifiers on the GO annotations. 


Like most of the challenges facing the GO we recognized the need 
for identifying classes that are taxon specific in the very early years 
(1996 or earlier). The solution finally fell into place when the 
taxon-constraint resource and corresponding web service were 
implemented (e.g., http://owlservices.berkeleybop.org/isClassA 
pplicableForTaxon?format=txt&idstyle=obo&id=GO:0005737 &t 
axid=NCBITaxon:131567) [5]. 


Following the precept of test early and often, the first annotation 
effort began at SGD in early 1999. Fly genes were already “anno- 
tated” because these were the seeds that GO grew from. The ques- 
tion was how well proto-GO, based on the needs of fly, would 
translate to another, very different, organism. An extremely simple 
tab-delimited annotation format was devised and the dialog began. 
Similarly the first automated pipeline “love-at-first-sight” was 
developed by Mark Yandell in late 1999 [6, 7] to describe the 
genes of the newly completed fly and human genomes. It was 
straightforward inference based on BLAST alignments, but it pro- 
vided a reasonable overview of the landscape. The response to 
these first efforts was overwhelmingly positive and adoption of GO 
very quickly accelerated. 

The GO project remains focused on providing an integrated data 
resource for functional information, both experimental (Chaps. 4 
and 6 [8, 9]) and predicted (Chap. 5 [10]), for all known proteins, 
noncoding RNA sequences, and cellular components. In other 
words, carrying out comprehensive functional annotation is what 
drives the project, not the ontology itself. The ontology provides the 
biological model that serves as the conceptual scaffolding for the bio- 
logical data. The Gene Ontology database contains currently over 
5.2 million function annotations for almost 900,000 gene products 
(mostly proteins but also some noncoding RNAs). About 660,000 of 
these annotations are based on experimental results reported in the 
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published literature, and the remainder are predictions derived from 
a variety of different methods. All of which are freely available for the 
community to use. That said, there is still considerable room for 
improvement. There was, and remains, a significant amount of accu- 
mulated knowledge to be captured. In particular for human, the 
annotation task is still more about capturing old data than capturing 
new data because an equivalent to a Model Organism Database does 
not exist. Until the day the GO catches up it will need to capture 
existing data in parallel with capturing more recent data to achieve 
the coverage it aims for. 


5 Where We Stand Today 


5.1 Multiple 
Relationship Types 
and Relations 
Between BP. 

MF, and CC 


52 Orthogonality 


Based on the wide adoption by the community, we can claim that 
the project met a real need. The GO is a useful alternative to simple 
nomenclature, as nomenclature fails to fully convey the biology 
and is too limited to describe protein roles fully. There is still a long 
way ahead: several of the key elements that we recognized as essen- 
tial in the nineties are still works in progress today. 


In 1999 we decided at the first official GO meeting against imple- 
menting relationships across the three branches of the GO until a 
later time. Needless to say this drastically over-simplified the bio- 
logical model, a simplification we were fully cognizant of but one 
that allowed us to prioritize our work. In this simplistic model with 
which GO began there were only two relationship types: is_a and 
part_of. And even here the meaning of part_of was conflated, since 
part_of in the cellular component branch of GO meant that that it 
was a sub-component while part_of in BP meant a step or sub- 
process. Since that time we continue to work on enriching the 
Relations Ontology and applying it appropriately (https://github. 
com/oborel/obo-relations). Currently there are eight relation- 
ships in use. Most significantly the three branches of the GO the 
ontologies are now being linked. 


We did not and do not want multiple “rival” ontologies for one 
domain. The initial necessity for embedding terms within other 
terms led to the creation of numerous implicit ontologies embed- 
ded within the GO (chemicals, anatomical parts, tissues, and cell 
types). In the early years, while we recognized that this might be 
dealt with by incorporating the unique identifier that refers to the 
full definition elsewhere, in practice this could not be reliably 
accomplished at that time and it is taking some time to remedy. 
Work to rectify the situation began shortly after the turn of the 
century [11] and has given rise to a small set of core ontologies, 
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Annotation 


which have been teased out of the GO and replaced by including 
the unique identifier for the new class as part of the logical defini- 
tion of the GO class. The first exercise was replacing all implicit 
references to chemicals in the GO with explicit references to ChEBI 
classes [12]. Similarly the Cell Type ontology was derived from the 
GO [13-15] and, as an autonomous ontology, has proven its own 
value for other applications. Expression analyses and RNASeq 
experiments often draw their samples from particular cell types and 
projects such as ENCODE [16] and FANTOM [17] are using the 
cell type ontology to indicate the source cell type for their data. In 
addition, there are coordinated efforts connecting the cell line 
ontology, used in cancer studies, to the cell type ontology to indi- 
cate the original cell type [18]. There is immense benefit to con- 
structing any ontology from its most element components because 
it provides a connective route across the widest possible network of 
projects. For example, RNA expression data from a cancer study 
that used a particular cell line can be automatically connected to an 
ENCODE RNA expression data from a normal cell type. 

As regards anatomy, Jonathan Bard initially raised the question 
of how we might consider a common language for anatomy. It was 
clear that we needed a methodology for anatomical interoperability 
and querying data across our various organisms, not just for gene 
function, but ultimately for phenotypes as well. As with chemicals 
and cell types, a species-neutral anatomical ontology was extracted 
from GO, but also incorporated existing anatomical ontologies 
(e.g., mouse, zebrafish, fossils) thereby creating bridges between 
them [19-21]. Beyond its use by GO Uberon is connecting pheno- 
type data, for example, from human (it is used for the logical defini- 
tions of the Human Phenotype Ontology) to mouse (likewise there 
are logical definitions underlying the Mouse Phenotype ontology) 
with direct applicability to human health research [22]. 

The challenge of comparability and interoperability can largely 
be overcome by community adoption of a small set of standard 
elemental core ontologies, from which special purpose ontologies, 
which meet the unique needs of a given project, can be constructed. 
It is hard to emphasize this enough. While the community seems 
to be blooming with a cacophony of idiosyncratic “ontologies” the 
GO is actively working to reduce the proliferation by deconstruct- 
ing its terms into the elemental core set of conceptual classes 
needed to define its complex terms. This approach is producing 
enormous dividends in terms of interoperability and comparability 
across widely divergent data sets. 


The context in which a function is carried out was recognized from 
the outset as crucial. For example, the role of glucagon-mediated 
signal transduction in liver concerns gluconeogenesis, glycogenol- 
ysis and plasma glucose homeostasis, whereas the role of this pro- 
cess in adipose tissue is lipolysis. At the level of gene products, the 
role of cytochrome C is in oxidative phosphorylation and energy 


6 What Lies Ahead 


6.1 Phylogenetic 
Annotation 


The Vision and Challenges of the Gene Ontology 299 


supply (when it is in the mitochondrion), and apoptosis (when it is 
in the cytoplasm). This has proven operationally (that is: how easy 
it is for someone to annotate) to be one of our biggest challenges 
(see Chap. 17 [23] on annotation extensions). While this has given 
curators a great deal more expressivity it still can be improved 
upon, and developing new annotation strategies and methods is 
where GO is actively working. 


The fundamental motivation driving the GO has remained 
unchanged: we are attempting to build a realistic model of biology 
to enable research, based on the collective evidence gathered by 
the research community. As originally envisioned we needed a way 
to describe the attributes of gene products in a rigorous way that 
would enable biologists to explore the universe of genomes and 
biology. As described above we were cognizant of them all initially 
and incrementally are addressing them and taking advantage of 
technological advances as we go. 

That said, the GO is predicated upon a reliable foundation of 
“annotation.” To gather accumulated knowledge as well as keep 
up with new research requires us to continue to seek new, more 
efficient approaches for biologists to provide their data. This is 
one of our current big challenges. One approach is collaborative 
data exchange with other annotation initiatives. For example, 
our collaborations with Reactome (http://www.reactome.org/) 
and IntAct (http://www.ebi.ac.uk/intact/) allow data from 
these resources to be incorporated into GO. Another key strat- 
egy is community annotation, such as described in Chap. 7 [24], 
which has provided GO with additional annotations. Our future 
plans are to provide online community annotation tools, which 
will also be used by GO Consortium curators—tools that will 
also support refinement of the GO itself in addition to providing 
annotations. 


Providing a resource that captures functional data for every extant 
protein is, to say the least, a formidable challenge. One obvious 
reason is that most sequences are not, nor ever will be, experimen- 
tally characterized (and not just because of volume, but also 
because some are experimentally intractable). Therefore most 
annotations must necessarily be based on predictions. Furthermore, 
for inferences to be as accurate as possible they should be predi- 
cated on an explicit evolutionary framework. For the past several 
years a small group of GO curators have been using an annotation 
tool, Phylogenetic Annotation and INference Tool (PAINT) [25] 
to infer annotations among members of a protein family. PAINT 
allows curators to make precise assertions as to when functions 
were gained and lost during evolution and record the evidence 
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7 Summary 


(1.e., the experimentally supported GO annotations from the leaves 
of the tree and their phylogenetic relationship to an ancestral pro- 
tein) for those assertions. PAINT 1s as yet a stand-alone desktop 
application, but work is underway to incorporate it into a suite of 
integrated, online annotation tools for GO curators and commu- 
nity contributors. Among the other tools in current development 
is one based on biological modules. 


Biological systems are modular at many levels. For example, within a 
single domain a catalytic site may be coupled to an (allosteric) bind- 
ing site that regulates the catalytic activity. Or, within a single pro- 
tein different domains may form a module, e.g., the ligand binding 
domain and protein kinase domain of a transmembrane protein 
kinase receptor. And further up the size are functional modules com- 
posed from subunits within a macromolecular complex (e.g., the 
ribosome). And, at an even higher level, molecular interactions can 
define a pathway that can be used or reused in multiple different 
processes (e.g., the ubiquitin-dependent proteolysis pathway or 
JAK-STAT pathway). The goal of this modular approach is to define 
each GO term through a combination of terms, and enable exten- 
sible representation of biological modularity: how elemental molec- 
ular interactions are combined in different ways to produce 
compound molecular functions, how molecular functions are com- 
bined to produce processes, and how processes are combined to 
produce larger processes. A first release of this curation tool (dubbed 
“Noctua”*) is now being evaluated by GO curators. One notable 
feature of this new tool is that it combines the tasks of annotation 
and ontology construction. Historically the artificial disconnect 
between these two inseparable tasks created serious bottlenecks, as 
annotators were forced to wait for a separate group to create or 
modify requisite terms. With Noctua the curators will more directly 
describing biology, with known relationships in the ontology associ- 
ated with specific instances that support this model. 


The goal of the Gene Ontology (GO) project is to provide a uniform 
way to describe the functions of gene products from organisms across 
all kingdoms of life and thereby enable analysis of genomic data. It is 
an ongoing enterprise as our understanding of biology grows and is 
refined. It is a computational model of biological reality that we ulti- 
mately hope every researcher will happily contribute to and regard as 
the optimum means of sharing the knowledge they have gained from 
their own research with the wider community. 


"Little owl (Athene noctua) is a bird that was sacred to the goddess Athena, 
the Greek goddess of wisdom. 
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